"If you want a quiet oncall rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical."
This is an excellent point that is missed in most monitoring setups I've seen. A classic example is some request that kills your service process. You get paged for that so you wrap the service in a supervisor like daemon. The immediate issue is fixed and, typically, any future causes of the service process dying are hidden unless someone happens to be looking at the logs one day.
I would love to see smart ways to surface "this will be a problem soon" on alerting systems.
Re: this will be a problem soon? Metrics trending. Look for changes in your metrics to spot potential problems and plan for the future. This is done quite often for example in QA to look for issues between releases, and can be done both macro and micro in terms of continuous delivery services' metrics.
I think anything that requires "spotting potential problems" is only a partial solution. I've never seen a compelling system that can look at all the metrics and (with reasonable precision and recall) spot and summarize changes that are actually problematic and surprising to humans. It's definitely a necessary part of observing what's going on (and quickly eliminating hypotheses like "maybe we're out of CPU!"), for sure.
The subcritical alerts I think of are more things like "Well, the database is _getting_ full, but it's not full yet." Or to borrow someone else's example, "we put in this daemon restarter when it was dying once a week; now it's dying every few minutes and we're only surviving because our proxy is masking the problem but soon it's going to take the whole site down."
These subcritical alerts deserve better but different handling: they can almost always be delivered to a non-paging email address, either a relevant internal mailing list or a ticket queue, where they can be investigated during normal office hours.
The other useful tip I have is to put URLs to internal wikis and/or tickets in the alert body. We write documentation for these to a 3AM standard: if I can't understand it immediately after being woken up at 3AM, it's not clear or actionable enough.
I think we're talking about the same thing. (Why does this keep happening?) Trending of metrics tells you whether an alert is useful or not. Has the database been getting full for over a month, or did it just begin getting full and the current rate of disk consumption means in 90 minutes it will be full?
There is no reason anyone should ever run out of disk space if they alert on the trending rate of disk space [rather than the actual amount of disk space used]. But this applies to so, SO many things other than simple resource exhaustion. Seeing the trends is useful to alerts, but it's also useful to humans who can review them weekly and plan for the future.
I don't want to threadjack, but you don't have an email in your profile.
If you (or anyone else reading) ever want to talk about what "this will be a problem soon" might look like in the future drop me a line: [email protected]
This is an excellent point that is missed in most monitoring setups I've seen. A classic example is some request that kills your service process. You get paged for that so you wrap the service in a supervisor like daemon. The immediate issue is fixed and, typically, any future causes of the service process dying are hidden unless someone happens to be looking at the logs one day.
I would love to see smart ways to surface "this will be a problem soon" on alerting systems.