gmork13's comments

gmork13 · on April 5, 2023

This is from today apr 5 saying the mmap change loads twice as big models with x100 speed up - is this not a blatant lie?

Wasn’t it discovered last week that loading larger models was an error in measurement and the speed up was from keeping things in memory after the first loading?

Please do correct me if I’m wrong.

saurik · on April 5, 2023

Justine knows this and it is stated right there on the page:

> The first time you load a model after rebooting your computer, it's still going to go slow, because it has to load the weights from disk. However each time it's loaded afterwards, it should be fast (at least until memory pressure causes your file cache to be evicted).

Zuiii · on April 6, 2023

Is the initial load via mmap slower than the non mmap approach? If it is, why is it slower?

jerf · on April 5, 2023

"Blatant lie" seems a bit strong. Running a large model for a second time in a row is a pretty common use case and that speedup strikes me as real in that common case. Attribution may have been wrong but the time saved is real.

gmork13 · on April 5, 2023

So it can load twice as large models somehow?

blitzkrieg3 · on April 5, 2023

mmap() will keep things in memory after first loading, but the page cache will _also_ keep things in memory after first loading. The difference is in order to re-use that you still need to read the file and store yourself (requiring 2x memory), instead of just doing a memory access. This has two consequences:

* 2x memory. A 20G data set requires 40G (20 for page cache and 20 for LLaMA)

* Things would be _even slower_ if they weren't in page cache after first loading. mmap is fast because it does not require a copy and reduces the working set size

IshKebab · on April 5, 2023

Why would you need to keep the entire file in cache though?

gmork13 · on April 3, 2023

Was that implementation confirmed to work or was it a bug?

I saw people saying it only loads the weights necessary, but all the weights are necessary (unless the architecture is heavily modified). Does it stream or load only the currently necessary layers in a way?

enlyth · on April 3, 2023

Looks like it will be reverted because it introduced some regressions and as far as I understand, there was no magic in the end, but there was a ton of drama around it.

Most of the drama is contained in this thread, if you like reading this sort of stuff: https://github.com/ggerganov/llama.cpp/pull/711#issuecomment...

smcameron · on April 3, 2023

It mmaps the data so the page any particular bit of data is on is only loaded into memory if something on that page is accessed (at least that is my understanding).

wsgeorge · on April 3, 2023

On start up it does load a fraction of the weights, so RAM usage is low, but as you run inferences on the model, it tends to eat up the required amount of RAM. This was from my initial testing.