Logging at Twitter

flurly · on Feb 19, 2022

Relevant posts about how other companies do it

Cloudflare - https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

Facebook - https://research.facebook.com/publications/scuba-diving-into...

For those who are smaller and don't have the money to pay for Splunk enterprise, and don't have the headcount to build your own logging infrastructure, I built a a service called GraphJSON that makes it super easy to log and process any type of data. You can read more about how and why I built it here https://www.graphjson.com/guides/about

bckr · on Feb 19, 2022

Thanks for the links!

I was hoping TFA would break down their log-based observability strategy and go into things like trace ids, structured logs....

Instead I am disappointed to learn that engineering at Twitter still sounds.... suboptimal. This from the company that brought us the fail whale (I can't find the blog post now, but I recall Twitter's instability was due to a massive kludge that went unnoticed for years).

jjmaestro · on Feb 19, 2022

Just a nit, for Facebook, Scuba is not the logging ingestion service. It's Scribe, which Scuba itself uses as the ingestion :)

https://engineering.fb.com/2019/10/07/data-infrastructure/sc...

(I worked in Scribe)

cduzz · on Feb 19, 2022

It's a pity that scribe is both astonishingly great and totally abandoned.

jjmaestro · on Feb 20, 2022

Another nit, the "abandoned" part is the one that was open sourced a loooong time ago. Since then, it underwent a ton of changes but yeah, it's within Facebook. So, not abandoned at all, but not available outside.

The main issue (as usual in big companies) is the large amount of inter-dependencies with internal systems. Scribe as-is today doesn't make much sense outside Facebook. And yeah, it could be cleaned up, mock the internal services and OSS it. But that's a lot of work, both doing it and maintaining it. And having all those mocks, etc "wouldn't be Scribe as-is either"in terms of how e.g. it scales and so on.

In any case, the storage (LogDevice, https://github.com/facebookarchive/LogDevice) which is a large part of the system was open sourced a while back and... sadly it went again into "not-maintained OSS".

Finally, you can get a similar-ish system working with OSS tools (e.g. fluentd + Kafka) that will also scale quite well and which IMHO can also made to scale to Facebook-size levels. So, the incentive is there, but there are OSS alternatives already available :)

cduzz · on Feb 21, 2022

Well, the open source part of it is totally abandoned.

The internal improvements aren't interesting or available to anyone outside of facebook.

The open source version is creaky and very difficult to drag forward to newer versions of thrift.

So -- thanks Facebook for dropping weird artifacts like the aftermath of Roadside Picnic. Woe unto all who adopt these soon-to-be-orphaned technologies.

falafelite · on Feb 19, 2022

Just started my 2 week trial of GraphJSON a few days ago! I like it so far, kudos. Unrelated question: do you have some kind of notification setup for HN? I feel like I’ve seen you comment on many logging-related posts, so figured I’d ask =)

flurly · on Feb 20, 2022

Thanks! I do use f5bot, but in this case I was just perusing HN and figured I'd drop some links and a plug :)

cduzz · on Feb 19, 2022

They didn't replace the sticky part of the logging pipeline.

"Logging" is

    * ingestion / message schema
    * queue 
    * ETL 
    * storage / analytics

The stickiest part of a logging pipeline is the ingestion and message format (scribe in this case); those are the gravity and speed of light of your organization, embedded at the lowest layer of everything everywhere.

Three years from now they could / will write a new blog post "today we're open sourcing smackhouse, a log analytics platform based on smooshing logs into clickhouse! We ETL the data using this zany rust thing (zaptl) from kafka into clickhouse, and made some zippy UI on top. We still stream logs into hadoop and goolens and splunk; any teams that want to use these other infrastructures just set a config in a fargml setting to zaptl to request the log stream and sample rate."

They're not using all of splunk; they're only using it to solve one of the harder parts of logging -- log analytics. They're not using it to solve the stickiest part of logging.

Good for them. BTW -- this reads like a cross between a proof of life video and a newsletter written by the intern. Glad it's earning them a huge discount from splunk.

(edited for formatting)

tantalor · on Feb 19, 2022

Which of these words did you just make up

kingcharles · on Feb 19, 2022

If you're not already using smackhouse to smoosh your logs into clickhouse, how can you even call yourself a developer, bro? Who do you code for? Standard Oil? Ma Bell?

endisneigh · on Feb 19, 2022

It’s fascinating to see companies solve the same problems over and over again.

I hope in the near future these types of problems are “definitively solved” and more effort can be spent engineering novel features.

There’s no way to do this, but I’d be interested to see if there’s a company that just has an extremely simple stack that is very performant and popular.

I’m talking: their entire stack is Citus sharded over a few locations running two beefy servers per continent they have users in, that’s it. No dedicated search, logging or anything. Just dump everything in Postgres. If logs are lost then so be it.

A lot of the problems at scale are problems of our own creation. For example is strong consistency really necessary for tweets? Do you really need to see a followers latest tweet instantly?

Uncheck some boxes and the engineering becomes a lot simpler, very fast.

teraflop · on Feb 19, 2022

> For example is strong consistency really necessary for tweets? Do you really need to see a followers latest tweet instantly?

The problem is that "strong consistency" basically means your system behaves in predictable ways that you can easily reason about. If you give it up, you can see all kinds of failure modes that go beyond just seeing stale data.

Suppose a user first sets their profile to "private", meaning only their trusted friends can see their posts. And then, having seen that that operation succeeded, they make a post that contains personal, sensitive information. You had better be damn sure that no user-facing components of your system can ever observe those two events in the wrong order.

endisneigh · on Feb 19, 2022

I don’t disagree at all with what you’re saying, but it’s all about expectation management.

pcmoney · on Feb 19, 2022

People should expect that when they mark their account as protected it is indeed protected.

Things need to do what they say on the box

RhodesianHunter · on Feb 19, 2022

If you threw strong consistency out the window and kept low latency you're still getting the "make profile private" event well before the user has navigated to the post page and composed a tweet.

Bjartr · on Feb 19, 2022

"and kept low latency" is not a reasonable assumption in the face of real world networks.

saryant · on Feb 19, 2022

The problem is ensuring that event has been propagated all around the world so you aren’t displaying now-private tweets. That’s why we need strong consistency.

I have spent much of my time at this company on this exact problem.

NavinF · on Feb 19, 2022

If the latency is bounded, then you have strong consistency. Just add a sleep(5) call after every write to make it linearizable.

lubesGordi · on Feb 19, 2022

Latency is never bounded, in fact that is what the P is in CAP.

NavinF · on Feb 19, 2022

Yep. It was not a serious suggestion. Without partitions, everything would be easy so it’s kinda silly to assume low latency.

ignoramous · on Feb 19, 2022

What if the process interrupts before it could sleep(5) and restarts...

NavinF · on Feb 19, 2022

Still linearizable under the insane assumption that the db is always “low latency” which implies it never goes down. The write either happened or it didn’t.

endisneigh · on Feb 19, 2022

Yes, what I’m saying is that it wouldn’t say that until it’s true. There’s no disagreement here.

542458 · on Feb 19, 2022

One problem is that at a certain point you can’t just lose logs - for example, somebody might pay you per ad impression or click. Losing logs can mean losing evidence that those impressions/clicks happened. Logs aren’t just a debugging tool, they can be a critical legal CYA.

I’m not saying that people don’t overcomplicate projects (they do) but a lot of simplifying assumptions can break down at scale.

likpok · on Feb 19, 2022

You can even handle impressions/clicks/conversions in a lossy system, you just have to bound how much loss you're willing to accept. The counterbalancing factor is how expensive you want your logging infrastructure to be -- the more consistency, the more it costs. The other trade off is availability: what do you do when the logging backend is temporarily down? Return a 503 to the user? Or accept that you're not making money on that impression and move on?

RhodesianHunter · on Feb 19, 2022

Storing everything is fairly simple and cheap. The complex bit is retrieving/aggregating/searching.

sateesh · on Feb 19, 2022

But why would one use log as source of truth ? If the info logged is important for the business, probably it should be stored in a DB (say).

endisneigh · on Feb 19, 2022

Logs are stored in a database, just not usually a relational one.

sateesh · on Feb 19, 2022

True, logs do needs to be stored somewhere. What l meant was: if the info is so important for the business, that detail should be stored in a system that provides a longer durability than just logging it.

alex_sf · on Feb 19, 2022

> just logging it.

This is the misunderstanding here. With a sufficiently complex system, they are never 'just logs'.

Aeolun · on Feb 19, 2022

Hmm, I think that anything you do not consider ‘just logs’ should be somewhere in your main database instead.

jeffbee · on Feb 19, 2022

That's backwards. DBs you can lose because they can be reconstructed from the logs.

sateesh · on Feb 19, 2022

Seems we are referring to two different things. What you mention references DB journal logs, and my comment was about application logs (like debug, info, error message etc.)

endisneigh · on Feb 19, 2022

Though I agree, I think the parents point is that you don’t necessarily know if it’s important so you have to have some general record keeping with respect to “activity”

obviouslynotme · on Feb 19, 2022

As you enter larger corporations, you see more Resume Driven Development. Overengineering isn't something to be avoided. It is the purpose of your 3 to 5 year stint there. Management rewards it so everyone participates in the circus together until one day the board fires 80% of engineering and hires contractors.

lurker616 · on Feb 19, 2022

How to prevent this though? From a game theoretic perspective?

mdoms · on Feb 19, 2022

I'd suggest you look at Stack Overflow engineering posts. Their architecture is about as simple as you can get for a service their size. A bunch of web servers running plain old ASP.NET MVC code (recently upgraded to use ASP.NET Core). A handful of SQL server machines for the database, simple sharding strategies. Some Redis machines for caching.

pcmoney · on Feb 19, 2022

Twitter’s real time aspect is a core value proposition compared to other social media platforms. Nothing tells you what is happening right this second as well as Twitter. It is faster than earthquake shock waves. And people rely on that speed for multi hundreds of millions of dollars in revenue.

taeric · on Feb 19, 2022

This feels heavily overstated. Especially when you consider that the vast majority of "influencers" and such actually schedule their tweets.

I agree that current tweets area important, of course. But I also acknowledge that it is just as often that i will be scrolling through the "In case you missed it" section of tweets. And, I have no idea how rapidly those actually made it to my timeline.

pcmoney · on Feb 20, 2022

How would an influencer schedule breaking news and live on the ground footage of whatever thing that is happening in the world?

pcmoney · on Feb 20, 2022

Additionally, Twitter is typically a poor place for influencers. Keep em on snaptokgram

pcmoney · on Feb 20, 2022

Oh, and the companies paying hundreds of millions of dollars are not integrated through the web UI

rabuse · on Feb 19, 2022

I'm no fan of Twitter, but I use it constantly to see the latest on happenings around the world.

taeric · on Feb 19, 2022

But would you know if there were seconds of consistency catching up? How?

saagarjha · on Feb 19, 2022

Is it? Especially when there is an xkcd about this exact phenomenon? https://xkcd.com/723/

taeric · on Feb 19, 2022

Is it overstated? Since my timeline constantly shifts under my scrolling, often showing me it of date crap, yes. Worse, many posts are ads or trending items that.. no, you would not know if there were seconds of lag in them getting processed.

You could almost argue they the small like counts have to be strong consistency. But, I often get notification of a like well before it shows on my timeline. So it is clearly not strongly consistent.

pcmoney · on Feb 20, 2022

You have to take into account that the people that Twitter makes the most money off of for real time functionality are not integrated through the web interface.

kevinmchugh · on Feb 19, 2022

The previous POTUS made important policy announcements on Twitter first. It would be very difficult to understand what was going on without fast, consistent tweets.

egypturnash · on Feb 19, 2022

With slow, unreliable tweets he might have had to be filtered through normal channels, which would have meant they came out at more predictable times like “press conferences” after being discussed by the people involved in that whole process, instead of “I watched Fox and Friends and now I’m pissed off about whatever they wanted me to be angry about today”. It would have been much easier to understand things, that shit was constant chaos.

bckr · on Feb 19, 2022

Please give an example of a circumstance in which this was true.

jacobolus · on Feb 20, 2022

Trump announced a ban on transgender soldiers via tweet like 1 day after giving a heads up to the Defense Secretary who was on vacation at the time. Nobody else in the military was prepared for the policy change and they were left scrambling.

Multiple times he fired/hired senior members of his administration via twitter, including defense secretary, secretary of state, national security advisor, chief of staff, ...

bitcharmer · on Feb 19, 2022

> Uncheck some boxes and the engineering becomes a lot simpler

I think you're oversimplifying to the point of nonsense. In my domain reliable logging is an absolute must due to regulations and the fact that we deal with people's money. Pretty sure we're not the only industry that can't afford approaching logging as you painted it.

endisneigh · on Feb 19, 2022

Sure, but most industries aren’t regulated to the extent that you would need logs.

dmhmr · on Feb 19, 2022

The blog post just cuts off at the end? "Ultimately, this migration has resulted in increased adoption of centralized logging, including among core application teams and operations and"

mttjj · on Feb 19, 2022

Log file was full. The rest of the sentence is in the other file.

VWWHFSfQ · on Feb 19, 2022

    twitterblog: blog-post-ending.html: twitterblog failed: No space left on device

labster · on Feb 19, 2022

They hit the character limit.

luxurytent · on Feb 19, 2022

Closing is hard

rabuse · on Feb 19, 2022

The good ol' Irish goodbye

bombcar · on Feb 19, 2022

> Currently we collect around 42 terabytes of data per datacenter each day.

I said dam. And that’s the logging data.

I do like how it ends up with “and then we basically ssh in and restart stuff by hand for reasons.”

johnnyAghands · on Feb 19, 2022

Ok so apart the article kinda being surprisingly bland -- I think the coolest part was how easily they were able to adopt a new architecture via the decoupling they had around producers and consumers via Kafka. Pub/sub, fan-out, messaging patterns are such great enablers. Seriously though, whats up with the lack of quality control on the post...

mdoms · on Feb 19, 2022

This is a marketing piece for Splunk and doesn't offer a lot of insight into actual logging practices at Twitter.

atlantas · on Feb 19, 2022

We now use Splunk

That could have fit in a tweet.

LordAtlas · on Feb 19, 2022

The number of times Splunk is mentioned in that, it feels like one of those "SEO-optimised" content marketing articles for Splunk.

ceejayoz · on Feb 19, 2022

I shudder to think how much Splunk costs at Twitter's scale.

mdoms · on Feb 19, 2022

This is how you get organisations that pull in $3.7 billion in revenue but still can't turn a profit :)

saagarjha · on Feb 19, 2022

Twitter has been profitable for the last several quarters, FWIW.

johndfsgdgdfg · on Feb 19, 2022

I am pretty sure Twitter gets a hufe discount. A company like Twitter is on Splunk platform itself a huge advertisement for more future traditional Fortune 500 companies.

jcims · on Feb 19, 2022

We log >>100TB/day in splunk and get all the discounts. It’s still ridiculously expensive.

Many of the issues presented in the article ring very true. Splunk is pretty amazing for adhoc analysis/threat hunting. However, once you know what you’re looking for the value proposition drops precipitously.

romantomjak · on Feb 19, 2022

Can you give a ballpark figure? I can’t even begin to imagine how much that costs.

jcims · on Feb 20, 2022

All in (software/hardware/people) ~$1-2k per terabyte of data indexed.

halilduygulu · on Feb 21, 2022

what kind of use case can log 100tb? Do you actually query this data later?

mulcahey · on Feb 19, 2022

This blog post read a little like an ad itself to be honest.

jamjamjamjamjam · on Feb 19, 2022

It probably gave them a 7 figure discount

jeffbee · on Feb 19, 2022

The whole architecture looks insanely expensive. Putting your debug logs through scribe, through Kafka, then into Splunk, and indexing them, has got to cost something like fifty dollars per gigabyte.

I've always been a proponent over leaving logs where they were produced, not collecting them, and not indexing them, so an architecture like this I find just shocking.

closeparen · on Feb 19, 2022

Centralized logging is one of the many problems you sign up for once you opt for microservices on a cluster scheduler. Since service instances are ephemeral and cluster nodes only slightly less so, it doesn’t really work to leave the logs in place.

mypalmike · on Feb 19, 2022

When you have fleets of thousands of machines performing a given service, and thousands of services implementing your product, you simply must centralize and index logs if you want to have any chance of managing outages.

avl999 · on Feb 19, 2022

I worked on a tier 1 service at Amazon with over 1500 hosts in the fleet. Logs within the hour would be on the host and we would literally just grep through them, to grep through logs on a subset of the fleet or the entire fleet there was a simple utility that would ssh into multiple prod host and run grep in parallel. For logs older than an hour the logs would go gzipped into storage and the we zgrep'd though them. No centralized logging (unless you call logs being rotated off the host at end of the hour) and definitely no indexing.

Never had any problems managing outages.

Grep alone can go a long way in managing outages.

romantomjak · on Feb 19, 2022

Thanks, this is a great insight. I haven’t thought about it in this way and it sounds like a very pragmatic approach without over complicating things.

mypalmike · on Feb 19, 2022

Interesting. I worked at Amazon on a large CDO product with hundreds or maybe low thousands of services. Some services probably had a couple instances, others had hundreds. There was a customized ELK stack that was indispensable, in my opinion, to tracking problems and communicating them (e.g. here's a URL that anyone on the org can look at). I'm trying to imagine distributed grep working as a solution for that org. Maybe it could work, but the large number of varied fleets owned by dozens of teams makes it a bit of a different problem.

avl999 · on Feb 20, 2022

Did your org stand up their own elk stack? AFAIK when I worked there the only centralized company wide log solution that was available was RTLA and RTLA was designed for detecting and alerting Fatals/Error/Exceptions in logs and not a general purpose log analysis tool like Splunk enterprise or the ELK stack.

I could see orgs standing up their own solutions like ELK but our service didn't have to. We just relied on grepping logs stored in Timber for logs older than an hour and grepping logs on prod hosts for real time searching during outages. Granted, our service did not have many dependent services but AFAIK the retail website which has tons of dependencies also followed a similar model (along with using RTLA for fatals) atleast at that time (circa about 3 years ago).

closeparen · on Feb 19, 2022

Did you ever worry about those greps interfering with production performance?

avl999 · on Feb 20, 2022

Not really. We only had the option to grep logs on prod hosts only for the current hour so the only time that happens is when there is critical issue going on, logs being limited to 1 hour also means they were limited to a few 100 MBs tops. Even my old 2014 era thinkpad can handle that workload without breaking a sweat. It didn't even cause a blip in our metrics. Grep is a well written tool.

All the log grepping for data older than the current hour happen off-prod host and thus was never a concern.

jeffbee · on Feb 19, 2022

Do you ever worry that your logs exporting agent interferes with service performance? After all, that goes on continuously in a setup like the one in the article, rather than as-needed in a distributed predicate evaluation setup.

closeparen · on Feb 20, 2022

A constant stable performance drag is going to show up in your load tests, capacity planning, etc. rather than in surprising intermittent degradations.

Also our log collector agents run in containers like everything else, so there is some amount of resource isolation (not perfect of course).

jeffbee · on Feb 19, 2022

All the resources are at the edges of your infrastructure, so distributed grep makes more and more sense the larger your installation becomes. Centralized logging looks less and less economic as things expand.

jeffbee · on Feb 19, 2022

Experience suggests otherwise.

mypalmike · on Feb 19, 2022

Interesting. What company have you worked for that runs hundreds of thousands of servers without centralized logging?

RhodesianHunter · on Feb 19, 2022

Not the person you replied to, but basically airlines, banks and other financial institutions, most things government or DOD are largely run this way.

yibg · on Feb 19, 2022

How many servers do airlines and banks run? And how do they get logs?

sbilstein · on Feb 19, 2022

+1, and how many engineers do they employ? thousands, just like big tech companies.

Aeolun · on Feb 19, 2022

In enterprises of this size, it’s quite possible you get a request to debug something that happened a month ago. Your instance may be gone, so you need to debug going only by whatever was logged.

dilyevsky · on Feb 19, 2022

Can’t do that in cloud (i know twitter doesnt run in cloud) and autoscale - machine goes poof and so do the logs. One of the reasons why autoscaling to save $ is bs.

jeffbee · on Feb 19, 2022

That’s simplistic. For one thing, at scale not every debug statement is a precious snowflake. Secondly, in an orchestration scheme like k8s there’s no reason why your container lifecycle can’t include a cleanup container that runs and either exports the logs or just sleeps for an hour, so ephemeral data lives a little longer.

dilyevsky · on Feb 19, 2022

Yes you can try to do bunch of hacks like that, but even just VM failure rate in cloud is insane compared to any modern hardware (looking at you, gcp, but aws too and lower tier providers are even worse). I see multiple VMs drop ded every week with no way to save those logs unless you use PDs ($$$)

closeparen · on Feb 19, 2022

Exports your logs to where? I would call that place centralized logging.

dilyevsky · on Feb 19, 2022

I’m assuming neighboring nodes. Now you’re basically designing replicated (best effort), distributed index - a mini version of google search. This is a clever design but number of companies that can diy it with sufficient quality I can probably count on one hand

jeffbee · on Feb 19, 2022

For the price of even a modest splunk enterprise setup you can hire half a dozen top notch backend/distributed systems engineers. That’s the opportunity cost of a setup like the one in the article.

dilyevsky · on Feb 19, 2022

Right if talking top notch that’s 3-4M/y alone. Admittedly I’m completely unaware what the splunk’s ballpark rate is (i heard it’s quite high) but still I doubt that very many companies are at that scale. Also in the current market convincing execs to save on something that doesn’t translate directly to cogs can be a challenging proposition =)

closeparen · on Feb 19, 2022

I’ve heard this before and it sounds like Splunk’s margins are just way too high. Why doesn’t someone sell a logging platform that’s cheaper than writing your own at scale?

CorpOverreach · on Feb 19, 2022

I was thinking something along the same lines.

The article said 42 TB per datacenter. How many datacenters is Twitter running?

Daneel_ · on Feb 19, 2022

The article said three datacentres further down, so they're capturing a total of around 126TB/day, which is big but definitely a long way short of the biggest Splunk deploys (multi-PB afaik).

aiisjustanif · on Feb 19, 2022

Egress filter before logging to Splunk people, hell if you can practice filtering better from the application itself.

No need to say send all you debug logs to security, or all your info logs to a logging instance for site reliability.

Aeolun · on Feb 19, 2022

I’m fairly certain it’s more than $1,000,000/year

ceejayoz · on Feb 19, 2022

I'm fairly certain there's at least another zero.

badrabbit · on Feb 19, 2022

Even for twitter, how can they afford it? It's the best in the game but charging by ingestion volume is not the best business model especially when they are not paying for storage. They could charge as an alternate based on users and endpoint entities/app instances that are logging instead. Now that Cisco is taking over Splunk, I don't expect it to stay great. But if they make it work it could save Cisco!

sharpy · on Feb 19, 2022

Worked for a company that used Splunk... I was told to log less.

That was real strange coming from Amazon, where I was told to log anything and everything... And often, without the logs, it would have been impossible to determine root cause of some issues.

The difference was that logging was super cheap at Amazon, because we'd store all the data, but no indexing on the contents of logs.

If your service is small, or if you are only interested in logs for a specific hosts, you could download, and grep through it... Or run some sort of mapreduce jobs against it. But that was a long time ago. Surely they have better tooling now.

badrabbit · on Feb 19, 2022

Graylog is perfect and free for smaller shops.

For security logging, Google cloud security (formerly Chronicle) lets you send unlimited data (but less control of the data or what you can do with it of course).

It has competition left and right from sumologic to elastic cloud, perfect time to sell to Cisco!

I've used Kibana and BigQuery, Splunk is lightyears ahead. It's not just for logging, it is excellent at big data analytics and visualization. This is what I struggle to communicate with people that develop and deploy these products. I can write a stupid front end to grep too if I just want to query and regex. I want the query language to let me extract and manipulate fields and their values very easily , let me measure all kinds of stats, control the output, pipeline between outputs and then visualize that data where possible.

You wanna see how manu unique users of Chrome 99.x.y.z transfer how much traffic and how frequently they see what page in your web logs? That's like 3-4 quick SPL lines in Splunk. Everyone else expects you to write parsing somewhere, stats elsewhere and visualization some other place and even then so many limits. Non-splunk users write pages of Jupyter notebook to replicate a short |stats splunk command.

wizwit999 · on Feb 19, 2022

As of a few months ago at least the toolings still basically the same lol.

crorella · on Feb 19, 2022

I work in a similar area, have some questions:

- How do you deal with standardization across the different events you log? How do you handle standardized naming, representations (formatting) and types?

- What about validations? do you validate the payloads in any way? if so, at what layer? Client before logging? In Kafka?

- Any privacy capabilities? How do you make sure the data is not accessed by processes/people that's not supposed to access a datum?

Thanks!

geodel · on Feb 19, 2022

Nice! More time for Twitter engineers to play around with cool technologies and let boring infrastructure challenges be handled by third party.

nhoughto · on Feb 19, 2022

I wonder what their splunk bill looks like, I imagine it’s not charged per GB =|

achillean · on Feb 19, 2022

If anybody is looking for an alternative I recommend checking out Gravwell. They use a similar syntax but use an architecture that makes scaling horizontally easier. They also have a free tier that lets you index around 14GB/ day:

https://www.gravwell.io/newsroom/gravwell-upgrades-community...

Disclaimer: we use them at Shodan and I'm on their board.

ttiurani · on Feb 19, 2022

> Currently we collect around 42 terabytes of data per datacenter each day.

And at the same time the article is very vague on by who and how all that data is read. A much more interesting question for a blog post is: how many bytes of the collected log data is used to create actual business value?

weego · on Feb 19, 2022

It's almost certainly the case at that scale and with engineer and tools churn the answer is: we don't know and that's why we're always adding and not subtracting

robertlagrant · on Feb 19, 2022

Presumably writing this blog post was the licence fee for one year of Splunk Enterprise.

netik · on Feb 19, 2022

such a sad read.

In the beginning there were so many great technologies there for dealing with logs and data at scale and they caved in and bought splunk.

An insanely expensive solution that was already long solved there for years

nikolay · on Feb 19, 2022

It's even sadder that it's an ad.

breakingcups · on Feb 19, 2022

It was probably a contractual obligation to lower the cost of Splunk

nikolay · on Feb 19, 2022

It's an embarrassing post, which I read like "we can't even build a commodity subsystem properly". And I think it also does not sound well for Splunk either - it's like a sign of desperation!

GordonS · on Feb 19, 2022

I believe that, it actually reads like it was written by someone at Splunk! Very little detail about what they are logging and why - it's a surprisingly poor article that is little more than a crappy ad for Splunk.

Macha · on Feb 19, 2022

With their previous tech stack, they were dropping 90% of logs via a rate limiter. This means that if you were looking to investigate a specific incident, the logs are much more likely to not be there. That feels more "fatally flawed" than "great"

mdoms · on Feb 19, 2022

I've worked at two organisations that use Splunk at huge scale. There really is no good alternative when you get really big and want a managed service.

jms703 · on Feb 19, 2022

This.

Also, when you have bespoke log solutions and engineers leave, you end up with unsupported solutions. This may be Twitters 4th logging solution IIRC. Plenty of people know splunk. Loglens, not so much.

orlovs · on Feb 19, 2022

+many. There are plenty good Enterprise software with bad rep. You can spend a lot of engineering talent to rebuild what is already solved or just buy boring tech. Splunk by definition is boring and with Twitter scale makes sense. Twitter is not too big, its decently sized.

saryant · on Feb 19, 2022

Loglens, there’s a word I hope to never hear again.

enw · on Feb 19, 2022

The missing space in "adoption.Our" is triggering me.

enw · on Feb 19, 2022

I can see people are downvoting, but the whole article reeks of lack of attention to detail. It ends abruptly: "[...] including among core application teams and operations and"

orlovs · on Feb 19, 2022

Interesting would be to see what was estimations to fix previous/improve solution, develop new one or buy Splunk.

helsinki · on Feb 19, 2022

I only had to read the summary to dissuade myself from continuing. Not interested in Twitter’s usage of vendor software.

surfsvammel · on Feb 19, 2022

I use Splunk daily. Splunk is awesome. Splunk sis ridiculously expensive :(

app4soft · on Feb 19, 2022

There should be next title instead:

> Logging at Twitter (2021)