Alerting and monitoring is not about logs. Applications export interesting signa...

dchichkov · on Oct 13, 2014

Well, to me "applications export interesting signals directly in a way understood by monitoring service" feels like a legacy approach. It places the burden of decision "what is an interesting alert signal" and burden of structuring the log file output on the software developer! And it places that burden at an inconvinient time, when the system is still in the making.

On the other hand, by logging everything as text, and then running intellegent/structurizing real time search engine over logs one can make/modify these decisions at a later time. And it can be done both by devs/ops, without touching the source code!

thrownaway2424 · on Oct 13, 2014

That seems silly though. I can replace stats on a thing that normally takes 50 usecs in a log line because it will take more than that long just to log the fact and an insane amount of cpu to analyze such a thing. The large scale systems that I personally operate produce a few KB per minute in structured stats, a few MB per second in structured logs, and hundreds of MB per second in unstructured text logs. I know which of these I'd rather use for monitoring.

dchichkov · on Oct 13, 2014

To thrownaway2424. What seems silly is that processing of a few of MB per second of unstructured text logs by a real time search engine seems impossible to you. Think web-crawlers. Search engines are efficient....

thrownaway2424 · on Oct 14, 2014

What do you use to monitoring the "real time search engine"?

dchichkov · on Oct 14, 2014

Is that a joke-question? The one that I've used is the elasticsearch / kibana. And usually one would be using elasticsearch to monitor the elasticsearch :)

That's the good thing about this setup, you have all the logs from all your applications (think like custom text logs from your routers, your custom applications, temperature sensors, syslogs, windows servers) aggregated in one place. And when something happens (at a particular moment in time, or with a particular machine, or with a particular key) suddenly you are able to search/drill down and locate the actual cause. And maybe even configure a dashboard or make a plot that would show when this problem was showing up.

Scalable real time search engines with the ability to create trends/dashboards is one powerfull toy ;) It is ridiculuos and silly. But it is an immensely powerfull approach.

donavanm · on Oct 14, 2014

youre thinking too small. Try hundreds of KB to a couple MB per second per host. And tens of thousands of hosts. Data streams at (tens of) gigabits per second are not trivial.

dchichkov · on Oct 14, 2014

I don't know. In my experience, one big elasticsearch box can cope with a few months of 2-3 MB/sec log data. I guess that the entropy of log file information is quite low and the search engine is being able to take advantage of that and keep its indexes rather small. But gigabits per second... I just don't know.

twic · on Oct 13, 2014

You can do alerting and monitoring through logs. I've done it myself. You can reduce the complexity of your infrastructure by converging those functions into a single set of tools. I would absolutely agree that the state of the art is capturing the evolving state of the system as a stream of events, and deriving monitoring and alerting from that stream.

apposite · on Oct 14, 2014

Logs are typically fairly unstructured and complex to parse. For whitebox monitoring (i.e. where you have access to the code and the code can report state) you are far better off exporting state in a very well defined format to minimise parsing overhead. It also tends to make you a bit more focused on defining the characteristics of the parameter you are monitoring.

You want blackbox monitoring (for close-to-user experience) AND whitebox monitoring (which provide diagnostics of internal state for debugging). True blackbox monitoring is often pretty unreliable so you are usually better of alerting on whitebox reported state of end user perceivable variables, e.g. HTTP error codes, latency and so on.

State of the art is to report a staggering amount of data about the internal state of a server. I mean a lot. 10s to 100s of times the number of parameters you are probably used to seeing.

Rapzid · on Oct 14, 2014

Typically yes. But I'm predicting a huge shift towards structured logging. Take a look at Serilog. It's a .Net logger I've been using recently that just has some fantastic concepts. Worth reading into even if you don't use .Net . I believe that's the direction "logging" will go... It's more eventing now I guess.

Both metrics and "logs" can be expressed as events. Those are like points in space. An incident could be like a line; a 2d event with a start and duration.