"If you want a quiet oncall rotation, it's imperative to have a system for deali...

0xbadcafebee · on Oct 13, 2014

Re: this will be a problem soon? Metrics trending. Look for changes in your metrics to spot potential problems and plan for the future. This is done quite often for example in QA to look for issues between releases, and can be done both macro and micro in terms of continuous delivery services' metrics.

robewaschuk · on Oct 13, 2014

I think anything that requires "spotting potential problems" is only a partial solution. I've never seen a compelling system that can look at all the metrics and (with reasonable precision and recall) spot and summarize changes that are actually problematic and surprising to humans. It's definitely a necessary part of observing what's going on (and quickly eliminating hypotheses like "maybe we're out of CPU!"), for sure.

The subcritical alerts I think of are more things like "Well, the database is _getting_ full, but it's not full yet." Or to borrow someone else's example, "we put in this daemon restarter when it was dying once a week; now it's dying every few minutes and we're only surviving because our proxy is masking the problem but soon it's going to take the whole site down."

dsr_ · on Oct 13, 2014

These subcritical alerts deserve better but different handling: they can almost always be delivered to a non-paging email address, either a relevant internal mailing list or a ticket queue, where they can be investigated during normal office hours.

The other useful tip I have is to put URLs to internal wikis and/or tickets in the alert body. We write documentation for these to a 3AM standard: if I can't understand it immediately after being woken up at 3AM, it's not clear or actionable enough.

0xbadcafebee · on Oct 14, 2014

I think we're talking about the same thing. (Why does this keep happening?) Trending of metrics tells you whether an alert is useful or not. Has the database been getting full for over a month, or did it just begin getting full and the current rate of disk consumption means in 90 minutes it will be full?

There is no reason anyone should ever run out of disk space if they alert on the trending rate of disk space [rather than the actual amount of disk space used]. But this applies to so, SO many things other than simple resource exhaustion. Seeing the trends is useful to alerts, but it's also useful to humans who can review them weekly and plan for the future.

ultrasaurus · on Oct 13, 2014

I don't want to threadjack, but you don't have an email in your profile.

If you (or anyone else reading) ever want to talk about what "this will be a problem soon" might look like in the future drop me a line: [email protected]