Great writeup. Should be in any operations handbook. One of the challenges I've found has been dynamic urgency, which is to say something is urgent when it first comes up, but now that its known and being addressed it isn't urgent anymore, unless there is something else going on we don't know about.
Example you get a server failure which affects a service, and you begin working on replacing that server with a backup, but a switch is also dropping packets and so you are getting alerts on degraded service (symptom) but believe you are fixing that cause (down server) when in fact you will still have a problem after the server is restored. So my challenge is figuring out how to alert on that additional input in a way that folks won't just say "oh yeah, this service, we're working on it already."
This is definitely the biggest problem in where I'm working now. We have a lot of monitoring via Wily Introscope, but the biggest thing is relating failures of different components together. E.g. one service layer fails so some queue gets backed up so some application server starts timing out.
The amount of noise that starts coming in when there is some major outage (say some mainframe system fails) is ridiculous.
Right now where I work they solve it by throwing manpower at the problem tbh.
It takes a lot of work by the application owners all working together to really get a coherent picture of how the services are interdependant, but the applications are so large, old code, etc - normal problems I guess a lot of companies face, that its almost impossible to find people who have a complete end to end understanding of most transcations.
Side note: my only monitoring experience is with Wily - anyone have opinions to hare on it?
Example you get a server failure which affects a service, and you begin working on replacing that server with a backup, but a switch is also dropping packets and so you are getting alerts on degraded service (symptom) but believe you are fixing that cause (down server) when in fact you will still have a problem after the server is restored. So my challenge is figuring out how to alert on that additional input in a way that folks won't just say "oh yeah, this service, we're working on it already."