I worked on a tier 1 service at Amazon with over 1500 hosts in the fleet. Logs w...

romantomjak · on Feb 19, 2022

Thanks, this is a great insight. I haven’t thought about it in this way and it sounds like a very pragmatic approach without over complicating things.

mypalmike · on Feb 19, 2022

Interesting. I worked at Amazon on a large CDO product with hundreds or maybe low thousands of services. Some services probably had a couple instances, others had hundreds. There was a customized ELK stack that was indispensable, in my opinion, to tracking problems and communicating them (e.g. here's a URL that anyone on the org can look at). I'm trying to imagine distributed grep working as a solution for that org. Maybe it could work, but the large number of varied fleets owned by dozens of teams makes it a bit of a different problem.

avl999 · on Feb 20, 2022

Did your org stand up their own elk stack? AFAIK when I worked there the only centralized company wide log solution that was available was RTLA and RTLA was designed for detecting and alerting Fatals/Error/Exceptions in logs and not a general purpose log analysis tool like Splunk enterprise or the ELK stack.

I could see orgs standing up their own solutions like ELK but our service didn't have to. We just relied on grepping logs stored in Timber for logs older than an hour and grepping logs on prod hosts for real time searching during outages. Granted, our service did not have many dependent services but AFAIK the retail website which has tons of dependencies also followed a similar model (along with using RTLA for fatals) atleast at that time (circa about 3 years ago).

closeparen · on Feb 19, 2022

Did you ever worry about those greps interfering with production performance?

avl999 · on Feb 20, 2022

Not really. We only had the option to grep logs on prod hosts only for the current hour so the only time that happens is when there is critical issue going on, logs being limited to 1 hour also means they were limited to a few 100 MBs tops. Even my old 2014 era thinkpad can handle that workload without breaking a sweat. It didn't even cause a blip in our metrics. Grep is a well written tool.

All the log grepping for data older than the current hour happen off-prod host and thus was never a concern.

jeffbee · on Feb 19, 2022

Do you ever worry that your logs exporting agent interferes with service performance? After all, that goes on continuously in a setup like the one in the article, rather than as-needed in a distributed predicate evaluation setup.

closeparen · on Feb 20, 2022

A constant stable performance drag is going to show up in your load tests, capacity planning, etc. rather than in surprising intermittent degradations.

Also our log collector agents run in containers like everything else, so there is some amount of resource isolation (not perfect of course).

jeffbee · on Feb 19, 2022

All the resources are at the edges of your infrastructure, so distributed grep makes more and more sense the larger your installation becomes. Centralized logging looks less and less economic as things expand.