Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unless the prior code was using O_DIRECT, the data was getting loaded into the kernel's page cache, and then the application was copying it into its own anonymous memory. Now the copy isn't happening. There are some subtleties involved [1] but it's not crazy to claim approximately half the RAM usage, even before bringing multiple processes into the picture.

[1] The kernel doesn't necessarily load the whole thing into page cache at once and keep it around indefinitely. It might have been recognizing a sequential loading pattern before and basically discarding pages almost immediately, where as now it might be keeping them for much longer. Or it might now be essentially skipping loading the whole thing in at once and doing it page-by-page on demand, which could be more RAM-efficient but slower. To some extent, you can control these behaviors with madvise, mlock, MAP_LOCKED, MAP_POPULATE, as well as various sysctls. Also, if it had to page out before, the anonymous memory was "dirty" and thus had to be swapped (written out to disk) where as the mmap()ed bytes are "clean" and can simply be discarded and (if needed to be paged back in later) reread from the existing file unchanged.



Thanks for the extra clarifications, but the claims were something impossible like a 23 Gb model only using 6Gb with this change. So maybe before this change it would have used a lot more of 23 Gb. I was referring to those miracle memory reductions, unfortunetly not possible, I would like to try 3 bit qunatizations when models and software will be ready(found none in my searches today)


Yes, those claims were a bit much, and in fairness jart chimed in to say so too. [1]

fwiw, I'm not a ML person, but it doesn't seem entirely crazy to me to think that SSDs are becoming fast enough that you could avoid keeping a huge model in RAM in some cases. Especially if "computational SSDs" (SSDs that can do some basic first-stage computation without transferring the input data over PCIe) ever become common. (I think some of the ML accelerators for sale today might be approximately this.)

[1] https://news.ycombinator.com/item?id=35393615


much of performance in computing is about moving the memory hierarchy around in ways that are inconvenient to programmers.

I made an SSD into a spare swap device, and basically treated my system as having RAM+SSD's worth of RAM. It allowed me to finish a few big jobs (~96GB RAM) overnight that wouldn't have otherwise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: