More

craigching · on Dec 6, 2024

Sampling is lossy though

iampims · on Dec 6, 2024

lossy and simpler.

IME, I've found sampling simpler to reason about, and with the sampling rate part of the message, deriving metrics from logs works pretty well.

The example in the article is a little contrived. Healthchecks often originate from multiple hosts and/or logs contain the remote address+port, leading to each log message being effectively unique. So sure, one could parse the remote address into remote_address=192.168.12.23 remote_port=64780 and then decide to drop the port in the aggregation, but is it worth the squeeze?

kiitos · on Dec 6, 2024

If a service emits a log event, then that log event should be visible in your logging system. Basic stuff. Sampling fails this table-stakes requirement.

eru · on Dec 6, 2024

Typically, you store your most recent logs in full, and you can move to sampling for older logs (if you don't want to delete them outright).

kiitos · on Dec 6, 2024

It's reasonable to drop logs beyond some window of time -- a year, say -- but I'm not sure why you'd ever sample log events. Metric samples, maybe! Log data, no point.

But, in general, I think we agree -- all good!

eru · on Dec 7, 2024

> It's reasonable to drop logs beyond some window of time -- a year, say [...]

That's reasonable in a reasonable environment. Alas, I worked in large legacy enterprises (like banks etc) where storage space is at much more of a premium for reasons.

You are right that sampling naively works better for metrics.

For logs you can still sample, but in a saner way: so instead of dropping each log line with an independent probability, you'll want correlation. Eg for each log file for each hour only flip one weighted coin to decide whether you want to keep the whole thing.

kiitos · on Dec 9, 2024

Metrics can be sampled, because metrics by definition represent some statistically summarize-able set of observations of related events, whose properties are preserved even when those observations are aggregated over time. That aggregation doesn't result in loss of information. At worst, it results in loss of granularity.

That same key property is not true for logs. Logs cannot be aggregated without loss of information. By definition. This isn't up for debate. You can collect logs into groups, based on similar properties, and you can decide that some of those groups are more or less important, based on some set of heuristics or decisions you can make in your system or platform or whatever. And you can decide that some of those groups can be dropped (sampled) according to some set of rules defined somewhere. But any and all of those decisions result in loss of information. You can say that that lost information isn't important or relevant, according to your own guidelines, and that's fine, but you're still losing information.

eru · on Dec 10, 2024

If you have finite storage, you need to do something. A very simple rule is to just drop everything older than a threshold. (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)

A slightly more complicated rules is: drop your logs with increasing probability as they age.

With any rule, you will lose information. Yes. Obviously. Sampling metrics also loses information.

The question is what's the best (or a good enough) trade-off for your use case.

kiitos · on Dec 10, 2024

> If you have finite storage, you need to do something.

Sure.

> A very simple rule is to just drop everything older than a threshold.

Yep, makes sense.

> (Or another simple rule: you keep x GiB of information, and drop the oldest logs until you fit into that size limit.)

Same thing, it seems like.

> A slightly more complicated rules is: drop your logs with increasing probability as they age.

This doesn't really make sense to me. Logs aren't probabilistic. If something happened on 2024-11-01T12:01:01Z, and I have logs for that timestamp, then I should be able to see that thing which happened at that time.

> Sampling metrics also loses information.

I mean, literally not, right? Metrics are explicitly the pillar of observability which can aggregate over time without loss of information. You never need to sample metric data. You just aggregate at whatever layers exist in your system. You can roll-up metric data from whatever input granularity D1 to whatever output granularity e.g. 10·D1, and that "loses" information in some sense, I guess, but the information isn't really lost, it's just made more coarse, or less specific. It's not in any way the same as sampling of e.g. log data, which literally deletes information. Right?

craigching · on July 10, 2024

Maybe even hermetic builds? https://bazel.build/basics/hermeticity

craigching · on April 15, 2024

We have an enhancement opened with Apple to have a way to delete .cstemp files if the tool runs into them. You'd think we could just add a `find . -name '*.cstemp' -exec rm {} \;` to our build toolchains before building, but we're in a large mono-repo and that would add a lot of time to our builds. Having something like a `--force` to delete the .cstemp files instead of quitting and reporting an error would make us change to this tool pretty quickly I'd think.

sxldier · on April 16, 2024

Have you tried doing `find . -name '*.cstemp' -exec rm {} +`? This provides as many arguments (filenames) as possible to the executed command (rm) rather than the `\;` which executes the command per file. This provides a massive speed improvement.

craigching · on Feb 23, 2024

At one company, I was responsible for putting together all the third-party software and their licenses. I called it the TPS report :)

craigching · on Nov 5, 2023

Probably anywhere that requires parsing large JSON documents. Off the shelf JSON parsers are notoriously slow on large JSON documents.

beached_whale · on Nov 5, 2023

There are several that are into the GB/s of performance with various interfaces. Most are just trash for large documents and sit in the allocators far too long, but that's not required either

zlg_codes · on Nov 6, 2023

What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

How big is 'large' here?

I built a simple CRUD inventory program to keep track of one's gaming backlog and progress, and the dumped JSON of my entire 500+ game statuses is under 60kB and can be imported in under a second on decade-old hardware.

I'm having difficulty picturing a JSON dataset big enough to slow down modern hardware. Maybe Gentoo's portage tree if it were JSON encoded?

EMM_386 · on Nov 6, 2023

> What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

I've been in the industry for a while. I've probably left more than one client site muttering "I've seen some things ...".

If it can be done, it will be done. And often in a way that shouldn't have even been considered at all.

Many times, "it works" is all that is needed. Not exactly the pinnacle of software design. But hey, it does indeed "work"!

pbrumm · on Nov 9, 2023

Insurance price transparency can have 16gb of compressed JSON that represents a single object.

Here is the anthem page. The toc link is 16gb

https://www.anthem.com/machine-readable-file/search/

They are complying with the mandate. But not optimizing for the parsers

Edwinr95 · on Nov 6, 2023

I've seen people dump and share entire databases in JSON format at my job....

nly · on Nov 6, 2023

I've seen tens of millions of market data events from a single day of trading encoded in JSON and used in various post-trade pipelines.

zlg_codes · on Nov 7, 2023

Ah, that's a dataset with a size certainly intimidating, and in an environment where performance means money. Thanks for pointing that out!

craigching · on Nov 8, 2023

In my case, sentry events that represent crash logs for Adobe Digital Video applications. I’m trying to remember off the top of my head, but I think it was in the gigabytes for a single event.

mxmlnkn · on Nov 6, 2023

Chrome trace format files also use JSON and can also become large and are a pain to work with.

ahoka · on Nov 5, 2023

Not necessarily, for example Newtonsoft is fine with multiple hundreds of megabyes if you use it correctly. But of course depends on how large we are talking about.

craigching · on Aug 25, 2023

Also micrograd from Andrej Karpathy: https://github.com/karpathy/micrograd

craigching · on Aug 11, 2023

Someone downvoted you, I upvoted because I think you have a good point but it would be nice to back it up. I think I agree with you, but I have only used concurrent.futures with threads.

milliams · on Aug 11, 2023

I'll give some more detail. concurrent.futures is designed to be a new consistent API wrapper around the functionality in the multiprocessing and threading libraries. One example of an improvement is the API for the map function. In multiprocessing, it only accepts a single argument for the function you're calling so you have to either do partial application or use starmap. In concurrent.futures, the map function will pass through any number of arguments.

The API was designed to be a standard that could be used by other libraries. Before if you started with thread and then realised you were GIL-limited then switching from the threading module to the multiprocessing module was a complete change. With concurrent.futures, the only thing that needs change is:

  with ThreadPoolExecutor() as executor:
     executor.map(...)

to

  with ProcessPoolExecutor() as executor:
     executor.map(...)

The API has been adopted by other third-party modules too, so you can do Dask distributed computing with:

  with distributed.Client().get_executor() as executor:
     executor.map(...)

or MPI with

  with MPIPoolExecutor() as executor:
     executor.map(...)

and nothing else need change.

This is why I chose to use it to teach my Parallel Python course (https://milliams.com/courses/parallel_python/).

henrydark · on Aug 11, 2023

> Before if you started with thread and then realised you were GIL-limited then switching from the threading module to the multiprocessing module was a complete change

Is this true?

I've been switching back and forth between multiprocessing.Pool and multiprocessing.dummy.Pool for a very long time. Super easy, barely an inconvenience.

craigching · on Aug 9, 2023

I remember maybe circa 2004 debating Postgres and mysql with a colleague. I told him to unplug the machine that was hosting his mysql instance. He did and corrupted his database. He said it didn't matter, he had backups, speed was more important :p This was before mysql had the innodb storage engine, after that it wasn't so bad. I have always stood by Postgres though, it's a fantastic piece of open source software.

craigching · on July 10, 2023

To your point, I replaced an LSTM that required ~$100k of infrastructure with XGBoost that required no more infrastructure (we created and used the model at query time on existing infrastructure we already had for query loads) and only lost about 2% accuracy (LSTM: 98%, XGBoost: 96%). This was two years ago and it's still in use.

craigching · on July 10, 2023

I used to introduce people new to machine learning with a python-converted version of ISL that I was developing. I never finished converting all of ISLR so this is very welcome!