Hacker Newsnew | past | comments | ask | show | jobs | submit | asciifree's commentslogin

I strongly believe whoever is on-call should have free reign to modify anything about their ops environment. The pitch for management is that while in the short term it might take time away from project work, eventually the reduction in interruptions will result in higher productivity.

p.s. Knew I recognised the name - loved following development of Planimeter/Grid a while ago!


Thank you so much! My company is shipping a finance platform at the moment, and I’d love to get back to Planimeter’s work when I am able.


> When I started people were paid for any hours they worked on-call

I've yet to hear of any alternative compensation model that actually works. Just pay people in their choice of money or time off in lieu. Sorry to hear you got screwed.

> Every single time, I documented what I did to fix it, as I did it, and handed it off to the ops team. Or in some cases I automated the fix. I have 0 tolerance for being called in my free time. I don’t care what the boss says my priorities are, if I’m being called at night, stopping that in its tracks is my #1 priority.

100% agree, I think people are far too tolerant of being paged. Especially management - the productivity impact of constant interrupts is huge. In a previous job one of my favourite things to do was go out to teams and just disable alerts they said were noisy or unactionable. If there was any pushback/consequence I was happy to accept responsibility (but never had to).


Disabling non-actionable alerts actually lowered the error rate in my experience, because people would start paying attention to the alerts. Even if they were being lazy, they'd be able to see a pattern after getting rid of the noise.


Exactly! Cut the noise, boost the signal. Every alert outside business hours should mean "drop everything and investigate this". Otherwise it can wait until the morning.


I think we somewhat agree.. Uncompensated on-call is not acceptable. Even if you're not busy, there is an ever-present burden to knowing you could be interrupted at any moment that takes a toll on your personal time.

But as long as the expected cost of downtime outweighs the financial cost of keeping someone available to fix it, on-call in some form will be inevitable. (There are a lot of instances where the cost doesn't make sense, and we should just accept the system being broken until 9am)

I don't think on-call needs to suck though. IMO "staffing issues" (whether it's headcount, time, competing priorities, etc) are resourcing issues and I believe better tooling can absolutely help with that - either by reducing the resources required to fix it or by making the cost of the issues quantifiable. Thanks for the good luck :)


Operational maturity - small-to-medium sized businesses will often treat things like their oncall as an afterthought as they grow, leading to every team doing things differently & chronically burnt out individual "heroes" taking on too much of the burden.

Large corporations are able to allocate dedicated FTEs to tackle this and standardise their processes around best practices. Things like the ability to staff "follow-the-sun" rotations that just aren't feasible at smaller scales.

I'm building Rezible (https://github.com/rezible/rezible) to address this. The aim is to provide an "oncall on rails" experience for teams, with best practices encoded into the product for engineers to follow.


Forgot to add - I've put together a website which I will add more info & documentation to soon: https://rezible.com


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: