IME, I've found sampling simpler to reason about, and with the sampling rate part of the message, deriving metrics from logs works pretty well.
The example in the article is a little contrived. Healthchecks often originate from multiple hosts and/or logs contain the remote address+port, leading to each log message being effectively unique. So sure, one could parse the remote address into remote_address=192.168.12.23 remote_port=64780 and then decide to drop the port in the aggregation, but is it worth the squeeze?
If a service emits a log event, then that log event should be visible in your logging system. Basic stuff. Sampling fails this table-stakes requirement.
It's reasonable to drop logs beyond some window of time -- a year, say -- but I'm not sure why you'd ever sample log events. Metric samples, maybe! Log data, no point.
> It's reasonable to drop logs beyond some window of time -- a year, say [...]
That's reasonable in a reasonable environment. Alas, I worked in large legacy enterprises (like banks etc) where storage space is at much more of a premium for reasons.
You are right that sampling naively works better for metrics.
For logs you can still sample, but in a saner way: so instead of dropping each log line with an independent probability, you'll want correlation. Eg for each log file for each hour only flip one weighted coin to decide whether you want to keep the whole thing.
Metrics can be sampled, because metrics by definition represent some statistically summarize-able set of observations of related events, whose properties are preserved even when those observations are aggregated over time. That aggregation doesn't result in loss of information. At worst, it results in loss of granularity.
That same key property is not true for logs. Logs cannot be aggregated without loss of information. By definition. This isn't up for debate. You can collect logs into groups, based on similar properties, and you can decide that some of those groups are more or less important, based on some set of heuristics or decisions you can make in your system or platform or whatever. And you can decide that some of those groups can be dropped (sampled) according to some set of rules defined somewhere. But any and all of those decisions result in loss of information. You can say that that lost information isn't important or relevant, according to your own guidelines, and that's fine, but you're still losing information.
If you have finite storage, you need to do something. A very simple rule is to just drop everything older than a threshold. (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)
A slightly more complicated rules is: drop your logs with increasing probability as they age.
With any rule, you will lose information. Yes. Obviously. Sampling metrics also loses information.
The question is what's the best (or a good enough) trade-off for your use case.
> If you have finite storage, you need to do something.
Sure.
> A very simple rule is to just drop everything older than a threshold.
Yep, makes sense.
> (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)
Same thing, it seems like.
> A slightly more complicated rules is: drop your logs with increasing probability as they age.
This doesn't really make sense to me. Logs aren't probabilistic. If something happened on 2024-11-01T12:01:01Z, and I have logs for that timestamp, then I should be able to see that thing which happened at that time.
> Sampling metrics also loses information.
I mean, literally not, right? Metrics are explicitly the pillar of observability which can aggregate over time without loss of information. You never need to sample metric data. You just aggregate at whatever layers exist in your system. You can roll-up metric data from whatever input granularity D1 to whatever output granularity e.g. 10·D1, and that "loses" information in some sense, I guess, but the information isn't really lost, it's just made more coarse, or less specific. It's not in any way the same as sampling of e.g. log data, which literally deletes information. Right?
We have an enhancement opened with Apple to have a way to delete .cstemp files if the tool runs into them. You'd think we could just add a `find . -name '*.cstemp' -exec rm {} \;` to our build toolchains before building, but we're in a large mono-repo and that would add a lot of time to our builds. Having something like a `--force` to delete the .cstemp files instead of quitting and reporting an error would make us change to this tool pretty quickly I'd think.
Have you tried doing `find . -name '*.cstemp' -exec rm {} +`? This provides as many arguments (filenames) as possible to the executed command (rm) rather than the `\;` which executes the command per file. This provides a massive speed improvement.
There are several that are into the GB/s of performance with various interfaces. Most are just trash for large documents and sit in the allocators far too long, but that's not required either
What on Earth are you storing in JSON that this sort of performance issue becomes an issue?
How big is 'large' here?
I built a simple CRUD inventory program to keep track of one's gaming backlog and progress, and the dumped JSON of my entire 500+ game statuses is under 60kB and can be imported in under a second on decade-old hardware.
I'm having difficulty picturing a JSON dataset big enough to slow down modern hardware. Maybe Gentoo's portage tree if it were JSON encoded?
In my case, sentry events that represent crash logs for Adobe Digital Video applications. I’m trying to remember off the top of my head, but I think it was in the gigabytes for a single event.
Not necessarily, for example Newtonsoft is fine with multiple hundreds of megabyes if you use it correctly. But of course depends on how large we are talking about.
Someone downvoted you, I upvoted because I think you have a good point but it would be nice to back it up. I think I agree with you, but I have only used concurrent.futures with threads.
I'll give some more detail. concurrent.futures is designed to be a new consistent API wrapper around the functionality in the multiprocessing and threading libraries. One example of an improvement is the API for the map function. In multiprocessing, it only accepts a single argument for the function you're calling so you have to either do partial application or use starmap. In concurrent.futures, the map function will pass through any number of arguments.
The API was designed to be a standard that could be used by other libraries. Before if you started with thread and then realised you were GIL-limited then switching from the threading module to the multiprocessing module was a complete change. With concurrent.futures, the only thing that needs change is:
with ThreadPoolExecutor() as executor:
executor.map(...)
to
with ProcessPoolExecutor() as executor:
executor.map(...)
The API has been adopted by other third-party modules too, so you can do Dask distributed computing with:
with distributed.Client().get_executor() as executor:
executor.map(...)
or MPI with
with MPIPoolExecutor() as executor:
executor.map(...)
> Before if you started with thread and then realised you were GIL-limited then switching from the threading module to the multiprocessing module was a complete change
Is this true?
I've been switching back and forth between multiprocessing.Pool and multiprocessing.dummy.Pool for a very long time. Super easy, barely an inconvenience.
I remember maybe circa 2004 debating Postgres and mysql with a colleague. I told him to unplug the machine that was hosting his mysql instance. He did and corrupted his database. He said it didn't matter, he had backups, speed was more important :p This was before mysql had the innodb storage engine, after that it wasn't so bad. I have always stood by Postgres though, it's a fantastic piece of open source software.
To your point, I replaced an LSTM that required ~$100k of infrastructure with XGBoost that required no more infrastructure (we created and used the model at query time on existing infrastructure we already had for query loads) and only lost about 2% accuracy (LSTM: 98%, XGBoost: 96%). This was two years ago and it's still in use.
I used to introduce people new to machine learning with a python-converted version of ISL that I was developing. I never finished converting all of ISLR so this is very welcome!