For those who are smaller and don't have the money to pay for Splunk enterprise, and don't have the headcount to build your own logging infrastructure, I built a a service called GraphJSON that makes it super easy to log and process any type of data. You can read more about how and why I built it here https://www.graphjson.com/guides/about
I was hoping TFA would break down their log-based observability strategy and go into things like trace ids, structured logs....
Instead I am disappointed to learn that engineering at Twitter still sounds.... suboptimal. This from the company that brought us the fail whale (I can't find the blog post now, but I recall Twitter's instability was due to a massive kludge that went unnoticed for years).
Another nit, the "abandoned" part is the one that was open sourced a loooong time ago. Since then, it underwent a ton of changes but yeah, it's within Facebook. So, not abandoned at all, but not available outside.
The main issue (as usual in big companies) is the large amount of inter-dependencies with internal systems. Scribe as-is today doesn't make much sense outside Facebook. And yeah, it could be cleaned up, mock the internal services and OSS it. But that's a lot of work, both doing it and maintaining it. And having all those mocks, etc "wouldn't be Scribe as-is either"in terms of how e.g. it scales and so on.
In any case, the storage (LogDevice, https://github.com/facebookarchive/LogDevice) which is a large part of the system was open sourced a while back and... sadly it went again into "not-maintained OSS".
Finally, you can get a similar-ish system working with OSS tools (e.g. fluentd + Kafka) that will also scale quite well and which IMHO can also made to scale to Facebook-size levels. So, the incentive is there, but there are OSS alternatives already available :)
Well, the open source part of it is totally abandoned.
The internal improvements aren't interesting or available to anyone outside of facebook.
The open source version is creaky and very difficult to drag forward to newer versions of thrift.
So -- thanks Facebook for dropping weird artifacts like the aftermath of Roadside Picnic. Woe unto all who adopt these soon-to-be-orphaned technologies.
Just started my 2 week trial of GraphJSON a few days ago! I like it so far, kudos. Unrelated question: do you have some kind of notification setup for HN? I feel like I’ve seen you comment on many logging-related posts, so figured I’d ask =)
The stickiest part of a logging pipeline is the ingestion and message format (scribe in this case); those are the gravity and speed of light of your organization, embedded at the lowest layer of everything everywhere.
Three years from now they could / will write a new blog post "today we're open sourcing smackhouse, a log analytics platform based on smooshing logs into clickhouse! We ETL the data using this zany rust thing (zaptl) from kafka into clickhouse, and made some zippy UI on top. We still stream logs into hadoop and goolens and splunk; any teams that want to use these other infrastructures just set a config in a fargml setting to zaptl to request the log stream and sample rate."
They're not using all of splunk; they're only using it to solve one of the harder parts of logging -- log analytics. They're not using it to solve the stickiest part of logging.
Good for them. BTW -- this reads like a cross between a proof of life video and a newsletter written by the intern. Glad it's earning them a huge discount from splunk.
If you're not already using smackhouse to smoosh your logs into clickhouse, how can you even call yourself a developer, bro? Who do you code for? Standard Oil? Ma Bell?
It’s fascinating to see companies solve the same problems over and over again.
I hope in the near future these types of problems are “definitively solved” and more effort can be spent engineering novel features.
There’s no way to do this, but I’d be interested to see if there’s a company that just has an extremely simple stack that is very performant and popular.
I’m talking: their entire stack is Citus sharded over a few locations running two beefy servers per continent they have users in, that’s it. No dedicated search, logging or anything. Just dump everything in Postgres. If logs are lost then so be it.
A lot of the problems at scale are problems of our own creation. For example is strong consistency really necessary for tweets? Do you really need to see a followers latest tweet instantly?
Uncheck some boxes and the engineering becomes a lot simpler, very fast.
> For example is strong consistency really necessary for tweets? Do you really need to see a followers latest tweet instantly?
The problem is that "strong consistency" basically means your system behaves in predictable ways that you can easily reason about. If you give it up, you can see all kinds of failure modes that go beyond just seeing stale data.
Suppose a user first sets their profile to "private", meaning only their trusted friends can see their posts. And then, having seen that that operation succeeded, they make a post that contains personal, sensitive information. You had better be damn sure that no user-facing components of your system can ever observe those two events in the wrong order.
If you threw strong consistency out the window and kept low latency you're still getting the "make profile private" event well before the user has navigated to the post page and composed a tweet.
The problem is ensuring that event has been propagated all around the world so you aren’t displaying now-private tweets. That’s why we need strong consistency.
I have spent much of my time at this company on this exact problem.
Still linearizable under the insane assumption that the db is always “low latency” which implies it never goes down. The write either happened or it didn’t.
One problem is that at a certain point you can’t just lose logs - for example, somebody might pay you per ad impression or click. Losing logs can mean losing evidence that those impressions/clicks happened. Logs aren’t just a debugging tool, they can be a critical legal CYA.
I’m not saying that people don’t overcomplicate projects (they do) but a lot of simplifying assumptions can break down at scale.
You can even handle impressions/clicks/conversions in a lossy system, you just have to bound how much loss you're willing to accept. The counterbalancing factor is how expensive you want your logging infrastructure to be -- the more consistency, the more it costs. The other trade off is availability: what do you do when the logging backend is temporarily down? Return a 503 to the user? Or accept that you're not making money on that impression and move on?
True, logs do needs to be stored somewhere. What l meant was: if the info is so important for the business, that detail should be stored in a system that provides a longer durability than just logging it.
Seems we are referring to two different things. What you mention references DB journal logs, and my comment was about application logs (like debug, info, error message etc.)
Though I agree, I think the parents point is that you don’t necessarily know if it’s important so you have to have some general record keeping with respect to “activity”
As you enter larger corporations, you see more Resume Driven Development. Overengineering isn't something to be avoided. It is the purpose of your 3 to 5 year stint there. Management rewards it so everyone participates in the circus together until one day the board fires 80% of engineering and hires contractors.
I'd suggest you look at Stack Overflow engineering posts. Their architecture is about as simple as you can get for a service their size. A bunch of web servers running plain old ASP.NET MVC code (recently upgraded to use ASP.NET Core). A handful of SQL server machines for the database, simple sharding strategies. Some Redis machines for caching.
Twitter’s real time aspect is a core value proposition compared to other social media platforms. Nothing tells you what is happening right this second as well as Twitter. It is faster than earthquake shock waves. And people rely on that speed for multi hundreds of millions of dollars in revenue.
This feels heavily overstated. Especially when you consider that the vast majority of "influencers" and such actually schedule their tweets.
I agree that current tweets area important, of course. But I also acknowledge that it is just as often that i will be scrolling through the "In case you missed it" section of tweets. And, I have no idea how rapidly those actually made it to my timeline.
Is it overstated? Since my timeline constantly shifts under my scrolling, often showing me it of date crap, yes. Worse, many posts are ads or trending items that.. no, you would not know if there were seconds of lag in them getting processed.
You could almost argue they the small like counts have to be strong consistency. But, I often get notification of a like well before it shows on my timeline. So it is clearly not strongly consistent.
You have to take into account that the people that Twitter makes the most money off of for real time functionality are not integrated through the web interface.
The previous POTUS made important policy announcements on Twitter first. It would be very difficult to understand what was going on without fast, consistent tweets.
With slow, unreliable tweets he might have had to be filtered through normal channels, which would have meant they came out at more predictable times like “press conferences” after being discussed by the people involved in that whole process, instead of “I watched Fox and Friends and now I’m pissed off about whatever they wanted me to be angry about today”. It would have been much easier to understand things, that shit was constant chaos.
Trump announced a ban on transgender soldiers via tweet like 1 day after giving a heads up to the Defense Secretary who was on vacation at the time. Nobody else in the military was prepared for the policy change and they were left scrambling.
Multiple times he fired/hired senior members of his administration via twitter, including defense secretary, secretary of state, national security advisor, chief of staff, ...
> Uncheck some boxes and the engineering becomes a lot simpler
I think you're oversimplifying to the point of nonsense. In my domain reliable logging is an absolute must due to regulations and the fact that we deal with people's money. Pretty sure we're not the only industry that can't afford approaching logging as you painted it.
The blog post just cuts off at the end?
"Ultimately, this migration has resulted in increased adoption of centralized logging, including among core application teams and operations and"
Ok so apart the article kinda being surprisingly bland -- I think the coolest part was how easily they were able to adopt a new architecture via the decoupling they had around producers and consumers via Kafka. Pub/sub, fan-out, messaging patterns are such great enablers. Seriously though, whats up with the lack of quality control on the post...
I am pretty sure Twitter gets a hufe discount. A company like Twitter is on Splunk platform itself a huge advertisement for more future traditional Fortune 500 companies.
We log >>100TB/day in splunk and get all the discounts. It’s still ridiculously expensive.
Many of the issues presented in the article ring very true. Splunk is pretty amazing for adhoc analysis/threat hunting. However, once you know what you’re looking for the value proposition drops precipitously.
The whole architecture looks insanely expensive. Putting your debug logs through scribe, through Kafka, then into Splunk, and indexing them, has got to cost something like fifty dollars per gigabyte.
I've always been a proponent over leaving logs where they were produced, not collecting them, and not indexing them, so an architecture like this I find just shocking.
Centralized logging is one of the many problems you sign up for once you opt for microservices on a cluster scheduler. Since service instances are ephemeral and cluster nodes only slightly less so, it doesn’t really work to leave the logs in place.
When you have fleets of thousands of machines performing a given service, and thousands of services implementing your product, you simply must centralize and index logs if you want to have any chance of managing outages.
I worked on a tier 1 service at Amazon with over 1500 hosts in the fleet. Logs within the hour would be on the host and we would literally just grep through them, to grep through logs on a subset of the fleet or the entire fleet there was a simple utility that would ssh into multiple prod host and run grep in parallel. For logs older than an hour the logs would go gzipped into storage and the we zgrep'd though them. No centralized logging (unless you call logs being rotated off the host at end of the hour) and definitely no indexing.
Interesting. I worked at Amazon on a large CDO product with hundreds or maybe low thousands of services. Some services probably had a couple instances, others had hundreds. There was a customized ELK stack that was indispensable, in my opinion, to tracking problems and communicating them (e.g. here's a URL that anyone on the org can look at). I'm trying to imagine distributed grep working as a solution for that org. Maybe it could work, but the large number of varied fleets owned by dozens of teams makes it a bit of a different problem.
Did your org stand up their own elk stack? AFAIK when I worked there the only centralized company wide log solution that was available was RTLA and RTLA was designed for detecting and alerting Fatals/Error/Exceptions in logs and not a general purpose log analysis tool like Splunk enterprise or the ELK stack.
I could see orgs standing up their own solutions like ELK but our service didn't have to. We just relied on grepping logs stored in Timber for logs older than an hour and grepping logs on prod hosts for real time searching during outages. Granted, our service did not have many dependent services but AFAIK the retail website which has tons of dependencies also followed a similar model (along with using RTLA for fatals) atleast at that time (circa about 3 years ago).
Not really. We only had the option to grep logs on prod hosts only for the current hour so the only time that happens is when there is critical issue going on, logs being limited to 1 hour also means they were limited to a few 100 MBs tops. Even my old 2014 era thinkpad can handle that workload without breaking a sweat. It didn't even cause a blip in our metrics. Grep is a well written tool.
All the log grepping for data older than the current hour happen off-prod host and thus was never a concern.
Do you ever worry that your logs exporting agent interferes with service performance? After all, that goes on continuously in a setup like the one in the article, rather than as-needed in a distributed predicate evaluation setup.
A constant stable performance drag is going to show up in your load tests, capacity planning, etc. rather than in surprising intermittent degradations.
Also our log collector agents run in containers like everything else, so there is some amount of resource isolation (not perfect of course).
All the resources are at the edges of your infrastructure, so distributed grep makes more and more sense the larger your installation becomes. Centralized logging looks less and less economic as things expand.
In enterprises of this size, it’s quite possible you get a request to debug something that happened a month ago. Your instance may be gone, so you need to debug going only by whatever was logged.
Can’t do that in cloud (i know twitter doesnt run in cloud) and autoscale - machine goes poof and so do the logs. One of the reasons why autoscaling to save $ is bs.
That’s simplistic. For one thing, at scale not every debug statement is a precious snowflake. Secondly, in an orchestration scheme like k8s there’s no reason why your container lifecycle can’t include a cleanup container that runs and either exports the logs or just sleeps for an hour, so ephemeral data lives a little longer.
Yes you can try to do bunch of hacks like that, but even just VM failure rate in cloud is insane compared to any modern hardware (looking at you, gcp, but aws too and lower tier providers are even worse). I see multiple VMs drop ded every week with no way to save those logs unless you use PDs ($$$)
I’m assuming neighboring nodes. Now you’re basically designing replicated (best effort), distributed index - a mini version of google search. This is a clever design but number of companies that can diy it with sufficient quality I can probably count on one hand
For the price of even a modest splunk enterprise setup you can hire half a dozen top notch backend/distributed systems engineers. That’s the opportunity cost of a setup like the one in the article.
Right if talking top notch that’s 3-4M/y alone. Admittedly I’m completely unaware what the splunk’s ballpark rate is (i heard it’s quite high) but still I doubt that very many companies are at that scale. Also in the current market convincing execs to save on something that doesn’t translate directly to cogs can be a challenging proposition =)
I’ve heard this before and it sounds like Splunk’s margins are just way too high. Why doesn’t someone sell a logging platform that’s cheaper than writing your own at scale?
The article said three datacentres further down, so they're capturing a total of around 126TB/day, which is big but definitely a long way short of the biggest Splunk deploys (multi-PB afaik).
Even for twitter, how can they afford it? It's the best in the game but charging by ingestion volume is not the best business model especially when they are not paying for storage. They could charge as an alternate based on users and endpoint entities/app instances that are logging instead. Now that Cisco is taking over Splunk, I don't expect it to stay great. But if they make it work it could save Cisco!
Worked for a company that used Splunk... I was told to log less.
That was real strange coming from Amazon, where I was told to log anything and everything... And often, without the logs, it would have been impossible to determine root cause of some issues.
The difference was that logging was super cheap at Amazon, because we'd store all the data, but no indexing on the contents of logs.
If your service is small, or if you are only interested in logs for a specific hosts, you could download, and grep through it... Or run some sort of mapreduce jobs against it. But that was a long time ago. Surely they have better tooling now.
For security logging, Google cloud security (formerly Chronicle) lets you send unlimited data (but less control of the data or what you can do with it of course).
It has competition left and right from sumologic to elastic cloud, perfect time to sell to Cisco!
I've used Kibana and BigQuery, Splunk is lightyears ahead. It's not just for logging, it is excellent at big data analytics and visualization. This is what I struggle to communicate with people that develop and deploy these products. I can write a stupid front end to grep too if I just want to query and regex. I want the query language to let me extract and manipulate fields and their values very easily , let me measure all kinds of stats, control the output, pipeline between outputs and then visualize that data where possible.
You wanna see how manu unique users of Chrome 99.x.y.z transfer how much traffic and how frequently they see what page in your web logs? That's like 3-4 quick SPL lines in Splunk. Everyone else expects you to write parsing somewhere, stats elsewhere and visualization some other place and even then so many limits. Non-splunk users write pages of Jupyter notebook to replicate a short |stats splunk command.
- How do you deal with standardization across the different events you log? How do you handle standardized naming, representations (formatting) and types?
- What about validations? do you validate the payloads in any way? if so, at what layer? Client before logging? In Kafka?
- Any privacy capabilities? How do you make sure the data is not accessed by processes/people that's not supposed to access a datum?
If anybody is looking for an alternative I recommend checking out Gravwell. They use a similar syntax but use an architecture that makes scaling horizontally easier. They also have a free tier that lets you index around 14GB/ day:
> Currently we collect around 42 terabytes of data per datacenter each day.
And at the same time the article is very vague on by who and how all that data is read. A much more interesting question for a blog post is: how many bytes of the collected log data is used to create actual business value?
It's almost certainly the case at that scale and with engineer and tools churn the answer is: we don't know and that's why we're always adding and not subtracting
It's an embarrassing post, which I read like "we can't even build a commodity subsystem properly". And I think it also does not sound well for Splunk either - it's like a sign of desperation!
I believe that, it actually reads like it was written by someone at Splunk! Very little detail about what they are logging and why - it's a surprisingly poor article that is little more than a crappy ad for Splunk.
With their previous tech stack, they were dropping 90% of logs via a rate limiter. This means that if you were looking to investigate a specific incident, the logs are much more likely to not be there. That feels more "fatally flawed" than "great"
I've worked at two organisations that use Splunk at huge scale. There really is no good alternative when you get really big and want a managed service.
Also, when you have bespoke log solutions and engineers leave, you end up with unsupported solutions. This may be Twitters 4th logging solution IIRC. Plenty of people know splunk. Loglens, not so much.
+many. There are plenty good Enterprise software with bad rep. You can spend a lot of engineering talent to rebuild what is already solved or just buy boring tech. Splunk by definition is boring and with Twitter scale makes sense. Twitter is not too big, its decently sized.
I can see people are downvoting, but the whole article reeks of lack of attention to detail. It ends abruptly: "[...] including among core application teams and operations and"
Cloudflare - https://blog.cloudflare.com/http-analytics-for-6m-requests-p...
Uber - https://eng.uber.com/logging/
Facebook - https://research.facebook.com/publications/scuba-diving-into...
For those who are smaller and don't have the money to pay for Splunk enterprise, and don't have the headcount to build your own logging infrastructure, I built a a service called GraphJSON that makes it super easy to log and process any type of data. You can read more about how and why I built it here https://www.graphjson.com/guides/about