Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are no memory improvements, people were not measuring correct. The giant improvement is the load times after the first run(if you do not invalidate your caches). Quantization to 4 bit is a big gain, the loss appears to be minimal from benchmarks. So with quantization you gain the ability to try a bigger model, if you have the hardware to fit the biggest model then you can skip it but for most people we need to try to fit the biggest model possible in our VRAM or RAM.


Unless the prior code was using O_DIRECT, the data was getting loaded into the kernel's page cache, and then the application was copying it into its own anonymous memory. Now the copy isn't happening. There are some subtleties involved [1] but it's not crazy to claim approximately half the RAM usage, even before bringing multiple processes into the picture.

[1] The kernel doesn't necessarily load the whole thing into page cache at once and keep it around indefinitely. It might have been recognizing a sequential loading pattern before and basically discarding pages almost immediately, where as now it might be keeping them for much longer. Or it might now be essentially skipping loading the whole thing in at once and doing it page-by-page on demand, which could be more RAM-efficient but slower. To some extent, you can control these behaviors with madvise, mlock, MAP_LOCKED, MAP_POPULATE, as well as various sysctls. Also, if it had to page out before, the anonymous memory was "dirty" and thus had to be swapped (written out to disk) where as the mmap()ed bytes are "clean" and can simply be discarded and (if needed to be paged back in later) reread from the existing file unchanged.


Thanks for the extra clarifications, but the claims were something impossible like a 23 Gb model only using 6Gb with this change. So maybe before this change it would have used a lot more of 23 Gb. I was referring to those miracle memory reductions, unfortunetly not possible, I would like to try 3 bit qunatizations when models and software will be ready(found none in my searches today)


Yes, those claims were a bit much, and in fairness jart chimed in to say so too. [1]

fwiw, I'm not a ML person, but it doesn't seem entirely crazy to me to think that SSDs are becoming fast enough that you could avoid keeping a huge model in RAM in some cases. Especially if "computational SSDs" (SSDs that can do some basic first-stage computation without transferring the input data over PCIe) ever become common. (I think some of the ML accelerators for sale today might be approximately this.)

[1] https://news.ycombinator.com/item?id=35393615


much of performance in computing is about moving the memory hierarchy around in ways that are inconvenient to programmers.

I made an SSD into a spare swap device, and basically treated my system as having RAM+SSD's worth of RAM. It allowed me to finish a few big jobs (~96GB RAM) overnight that wouldn't have otherwise.


> There are no memory improvements, people were not measuring correct.

Using filebacked pages instead of anonymous memory is a real improvement because it doesn't have to get swapped out if there's memory pressure. And this program probably isn't the only thing running on the machine.


I was referring that you would not gain any memory, there was no magic compression so you could use a bigger model on the same hardware. There were some wild claims made but it was some people meassuring memory usage wrong, but you are correct there might be some small memory improvements and soem speed improvements.


Well, you can use a bigger model now, it will "just" be really slow. This is different from GPUs, which would just fail to load larger models than VRAM because they don't support paging (unless you build that yourself.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: