Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked on a tier 1 service at Amazon with over 1500 hosts in the fleet. Logs within the hour would be on the host and we would literally just grep through them, to grep through logs on a subset of the fleet or the entire fleet there was a simple utility that would ssh into multiple prod host and run grep in parallel. For logs older than an hour the logs would go gzipped into storage and the we zgrep'd though them. No centralized logging (unless you call logs being rotated off the host at end of the hour) and definitely no indexing.

Never had any problems managing outages.

Grep alone can go a long way in managing outages.



Thanks, this is a great insight. I haven’t thought about it in this way and it sounds like a very pragmatic approach without over complicating things.


Interesting. I worked at Amazon on a large CDO product with hundreds or maybe low thousands of services. Some services probably had a couple instances, others had hundreds. There was a customized ELK stack that was indispensable, in my opinion, to tracking problems and communicating them (e.g. here's a URL that anyone on the org can look at). I'm trying to imagine distributed grep working as a solution for that org. Maybe it could work, but the large number of varied fleets owned by dozens of teams makes it a bit of a different problem.


Did your org stand up their own elk stack? AFAIK when I worked there the only centralized company wide log solution that was available was RTLA and RTLA was designed for detecting and alerting Fatals/Error/Exceptions in logs and not a general purpose log analysis tool like Splunk enterprise or the ELK stack.

I could see orgs standing up their own solutions like ELK but our service didn't have to. We just relied on grepping logs stored in Timber for logs older than an hour and grepping logs on prod hosts for real time searching during outages. Granted, our service did not have many dependent services but AFAIK the retail website which has tons of dependencies also followed a similar model (along with using RTLA for fatals) atleast at that time (circa about 3 years ago).


Did you ever worry about those greps interfering with production performance?


Not really. We only had the option to grep logs on prod hosts only for the current hour so the only time that happens is when there is critical issue going on, logs being limited to 1 hour also means they were limited to a few 100 MBs tops. Even my old 2014 era thinkpad can handle that workload without breaking a sweat. It didn't even cause a blip in our metrics. Grep is a well written tool.

All the log grepping for data older than the current hour happen off-prod host and thus was never a concern.


Do you ever worry that your logs exporting agent interferes with service performance? After all, that goes on continuously in a setup like the one in the article, rather than as-needed in a distributed predicate evaluation setup.


A constant stable performance drag is going to show up in your load tests, capacity planning, etc. rather than in surprising intermittent degradations.

Also our log collector agents run in containers like everything else, so there is some amount of resource isolation (not perfect of course).


All the resources are at the edges of your infrastructure, so distributed grep makes more and more sense the larger your installation becomes. Centralized logging looks less and less economic as things expand.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: