> What do I need to monitor? I think it's worth looking at those decisions as wa...

> What do I need to monitor?

I think it's worth looking at those decisions as wanting to be made both top-down -and- bottom-up.

Top-down as in "start from what makes the business a viable business and then analyze downwards from there" which gets you things like user-facing site/API availability, performance, error rates etc. (and maybe things like per-user usage rates depending on what you're doing and what your business model is).

Bottom-up as in "start from what makes the infrastructure exist at all and work up from there" which gets you things like disk usage, RAM, CPU, network link saturation - all the low level stuff that won't affect your top-down metrics until it does, at which point everything will catch fire at once.

They'll hopefully meet somewhere in the middle in a way that makes sense, and you could perhaps argue with per-internal-service monitoring as being a sort of middle-outwards, but I suspect the highest and lowest level checks are probably the most useful ones to start with and then you extend from there as you get a feel for what situations cause those to fire off and start monitoring the mid-range of the '5 whys' rather than just 1 and 5.

(I'm not sure I've made this as clear as I wanted but such is the peril of waxing philosophical about things)