I maybe should not be surprised, given that we live in the era of Unity and Electron, but using mmap() to load large files should be not be seen as rocket science.
And this is basically available on almost any platform with a MMU and a kernel.
Using memory mapped files is not always the right answer.
Memory mapped files have their disadvantages. The biggest disadvantage is that any disk read error (or yanking the USB drive) becomes an access violation exception (also known as a crash), just like you read from a bad pointer. You need to have robust exception handling, which is a taller order than just checking a return value.
Another disadvantage is that even when you have your pages mapped into memory, calling the page fault handler and getting your page has a cost of ~1200 CPU cycles on Windows just to do the User<->Kernel mode transition, plus the cost of actually performing the IO. "Just reading the file" skips many User<->Kernel mode transitions, so it's one per read call rather than one per page fault.
although it's true that many hardware problems exhibit as SIGBUS on memmapped memory, remember that this is an API and implementation written for high performance disk drives on important servers; for example, the ingres server on berkeley's research vax (IIRC mmap became used widely after one of the BSD 4.3 subreleases was released). IE, at the time, the idea of a drive that could be easily detached being used for production computing would have been crazy so I think crashing the app when a drive is removed is not completely insensible.
The fault will also raise a signal if there is an error reading the sector from the drive (what would be an EIO from read()). Lack of error handling in mmap isn't only a problem for removable media.
yes, that sounds like a good idea to me. Like I said: if you use mmap, the expectation is that the drive will not bork and if it does, it should terminate the application.
I think there just hasn't been a consumer application that is really resource constrained, for a long time now. Only things for enthusiasts have been. LLMs have product market fit, but running a useful one client side is resource constrained, but instead of it truly being a consumer hardware limitation, it just turns out they were never optimized to begin with - coming from the perceived "top AI/ML minds" at FAANGs, while some of the most basic optimizations are seemingly a lost art.
On the other hand, its only been a few weeks, so maybe I should ignore this absurdity and just wait.
Probably a combination of (a) ML framework people not paying much attention to CPU inference due to already having GPUs/TPUs already lying around for training - CPU inference is just for very quick experiments (b) research code has never been the best optimized for performance (c) ML people are not generally systems programmers, and a lot of systems programmers are afraid to mess with the ML code outside of low-level computation kernels (doesn't help that ML code is notoriously unreproducible).
It's indeed a very different world. This model was trained on thousands of GPUs. The weird file format corresponds to the train time sharding of the weights. And really nobody is doing CPU inference with all the GPU we have. And also the "CLI" use case seems contrieved to me. If you plan to interact several times with the model and want to keep the weights in RAM, why don't you start a REPL or spin up a server?
> while some of the most basic optimizations are seemingly a lost art
mmap isn't relevant to anyone except CPU-using programmers because other hardware doesn't have virtual memory paging. Firmware programmers don't care, GPU programmers don't care.
AFAIK CUDA offers unified memory which basically works with virtual address space and page faulting in data from main memory. There is also IOMMU in general.
Many of us would like to get rid of the host CPU and have ML trainers that are just GPUs and drives and NICs all attached to a northbridge. The GPU has everything required to make disk requests over the bus, and ideally the drive can receive network messages that get plumbed straight to the drive (I'm only partially joking).
Word embeddings were big for their time (especially with subword embeddings like fastText). We mmaped word embeddings for similar reasons. But yeah, I was kinda surprised that one post about LLaMa.cpp mmap support talked about a 'fairly new technique'. mmap has been in a UNIX programmer's tool belt for literally decades.
I maybe should not be surprised, given that we live in the era of Unity and Electron, but using mmap() to load large files should be not be seen as rocket science.
And this is basically available on almost any platform with a MMU and a kernel.