I feel the same. I maybe should not be surprised, given that we live in the era ...

Dwedit · on April 5, 2023

Using memory mapped files is not always the right answer.

Memory mapped files have their disadvantages. The biggest disadvantage is that any disk read error (or yanking the USB drive) becomes an access violation exception (also known as a crash), just like you read from a bad pointer. You need to have robust exception handling, which is a taller order than just checking a return value.

Another disadvantage is that even when you have your pages mapped into memory, calling the page fault handler and getting your page has a cost of ~1200 CPU cycles on Windows just to do the User<->Kernel mode transition, plus the cost of actually performing the IO. "Just reading the file" skips many User<->Kernel mode transitions, so it's one per read call rather than one per page fault.

dekhn · on April 5, 2023

although it's true that many hardware problems exhibit as SIGBUS on memmapped memory, remember that this is an API and implementation written for high performance disk drives on important servers; for example, the ingres server on berkeley's research vax (IIRC mmap became used widely after one of the BSD 4.3 subreleases was released). IE, at the time, the idea of a drive that could be easily detached being used for production computing would have been crazy so I think crashing the app when a drive is removed is not completely insensible.

loeg · on April 6, 2023

The fault will also raise a signal if there is an error reading the sector from the drive (what would be an EIO from read()). Lack of error handling in mmap isn't only a problem for removable media.

dekhn · on April 6, 2023

yes, that sounds like a good idea to me. Like I said: if you use mmap, the expectation is that the drive will not bork and if it does, it should terminate the application.

Dwedit · on April 6, 2023

In addition to a drive being removed, it also happens for a network share over wifi when the connection is temporarily lost.

marwis · on April 5, 2023

Wouldn't huge pages and readahead make number of page faults and context switches potentially smaller than with read()?

yieldcrv · on April 5, 2023

I think there just hasn't been a consumer application that is really resource constrained, for a long time now. Only things for enthusiasts have been. LLMs have product market fit, but running a useful one client side is resource constrained, but instead of it truly being a consumer hardware limitation, it just turns out they were never optimized to begin with - coming from the perceived "top AI/ML minds" at FAANGs, while some of the most basic optimizations are seemingly a lost art.

On the other hand, its only been a few weeks, so maybe I should ignore this absurdity and just wait.

telotortium · on April 5, 2023

Probably a combination of (a) ML framework people not paying much attention to CPU inference due to already having GPUs/TPUs already lying around for training - CPU inference is just for very quick experiments (b) research code has never been the best optimized for performance (c) ML people are not generally systems programmers, and a lot of systems programmers are afraid to mess with the ML code outside of low-level computation kernels (doesn't help that ML code is notoriously unreproducible).

gwenzek · on April 7, 2023

It's indeed a very different world. This model was trained on thousands of GPUs. The weird file format corresponds to the train time sharding of the weights. And really nobody is doing CPU inference with all the GPU we have. And also the "CLI" use case seems contrieved to me. If you plan to interact several times with the model and want to keep the weights in RAM, why don't you start a REPL or spin up a server?

astrange · on April 5, 2023

> while some of the most basic optimizations are seemingly a lost art

mmap isn't relevant to anyone except CPU-using programmers because other hardware doesn't have virtual memory paging. Firmware programmers don't care, GPU programmers don't care.

leni536 · on April 5, 2023

AFAIK CUDA offers unified memory which basically works with virtual address space and page faulting in data from main memory. There is also IOMMU in general.

dekhn · on April 5, 2023

Many of us would like to get rid of the host CPU and have ML trainers that are just GPUs and drives and NICs all attached to a northbridge. The GPU has everything required to make disk requests over the bus, and ideally the drive can receive network messages that get plumbed straight to the drive (I'm only partially joking).

microtonal · on April 5, 2023

Word embeddings were big for their time (especially with subword embeddings like fastText). We mmaped word embeddings for similar reasons. But yeah, I was kinda surprised that one post about LLaMa.cpp mmap support talked about a 'fairly new technique'. mmap has been in a UNIX programmer's tool belt for literally decades.

kristjansson · on April 5, 2023

> never optimized to begin with

I think the better read is that they're being adapted to new applications, constraints, and environments, all at once.

chad1n · on April 5, 2023

Why would Facebook care about running LLAMA on a cpu with optimizing for 1-2% more latency when it has a lot of A100s laying around?