Great writeup. Should be in any operations handbook. One of the challenges I've ...

orangesareok · on Oct 14, 2014

This is definitely the biggest problem in where I'm working now. We have a lot of monitoring via Wily Introscope, but the biggest thing is relating failures of different components together. E.g. one service layer fails so some queue gets backed up so some application server starts timing out.

The amount of noise that starts coming in when there is some major outage (say some mainframe system fails) is ridiculous.

Right now where I work they solve it by throwing manpower at the problem tbh.

It takes a lot of work by the application owners all working together to really get a coherent picture of how the services are interdependant, but the applications are so large, old code, etc - normal problems I guess a lot of companies face, that its almost impossible to find people who have a complete end to end understanding of most transcations.

Side note: my only monitoring experience is with Wily - anyone have opinions to hare on it?