I feel significantly dumber for reading that merge request. The one thing to und...

marginalia_nu · on April 5, 2023

> only work when you have much more RAM than the files you're mapping in.

Really depends on what you're doing, like memory access patterns. I've definitely seen scenarios when mapping hundreds of gigabytes of data on dozens of gigabytes of ram where mmap has been an almost absurd performance boost over traditional I/O, both immediately but also asymptotically as all the most frequently accessed data ends up in cache and the least accessed data is paged out.

I don't disagree with the subtlety part though. It's very difficult to reason about I/O performance in general. Modern systems are like an onion of hidden performance optimization tricks and caching layers (both in software and hardware).

Karrot_Kream · on April 5, 2023

Yeah and on top of that, different systems (software and hardware combos) are different, so I can see the performance of this depending on the implementation of mmap on the system and the implementation of caches and virtual memory on the architecture. When I've debugged stuff like this, it's either been for myself in which case I know what combo I'm running on or it's been for work where we know which combinations we target and we run regression tests to observe perf implications.

jjoonathan · on April 5, 2023

> least accessed data is paged out

Aren't all the weights touched in every pass?

marginalia_nu · on April 5, 2023

Speaking in general.

londons_explore · on April 5, 2023

In this case, the main benefit is from multiple invocations of the same program.

Using mmap, you avoid doing any work at all the 2nd time you load the file.

dekhn · on April 5, 2023

Yes- I have 35 years experience with UNIX and used to use mmapping with BLAST, a sequence search tool, as well as my own codes.

I'll repeat myself: mmap is subtle. If what you mmap is larger than your host RAM, only some of the pages will be loaded at any time, and depending on access patterns, can lead to significant paging.

bigodanktime · on April 5, 2023

What do you mean by work. The underlying page cache will keep much of the data actual cached if it's recent. Even databases like PostGreSQL use this to their advantage (https://github.com/postgres/postgres/blob/master/src/backend...).

astrange · on April 5, 2023

Copying the file backed pages to heap memory and possibly having to swap them out.

bigodanktime · on April 5, 2023

I may have parsed your statement incorrectly, but I'm assuming you are talking about the copy of data when using either mmap or File IO (memcpy versus write) Whether you do File IO versus mmap, there's going to be copy. With files, the copy occurs within kernel space with data being copied into the pages in the buffer cache, with mmap the copy occurs in userspace with data being copied into the address space. Swapping can occur in the buffer cache or mmap, this is why so many databases implement their own buffer cache to ensure specific data isn't flushed, leaving them in an inconsistent state.

An advantage of copying in userspace is the ability to use more performant instructions to perform the memcopy, which the kernel does not typically have access to (https://www.mongodb.com/blog/post/getting-storage-engines-re...)

astrange · on April 5, 2023

> With files, the copy occurs within kernel space with data being copied into the pages in the buffer cache, with mmap the copy occurs in userspace with data being copied into the address space.

There is no copy with mmap, the page is either unwritable or CoW. There's always a copy with read(). (But read() can still be faster and more memory efficient nevertheless.)

> An advantage of copying in userspace is the ability to use more performant instructions to perform the memcopy, which the kernel does not typically have access to (https://www.mongodb.com/blog/post/getting-storage-engines-re...)

Darwin kernel does though.

I believe Linux uses the builtin old memcpy instructions on Intel, just to force CPU vendors to keep them usable.

bigodanktime · on April 5, 2023

> There is no copy with mmap

You are right, if you are directly modifying the mmaped region. I always internally model my data as staging my changes to be synchronized to the mmaped region, so thats my mistake there.

> the page is either unwritable or CoW.

This is not universally true, or maybe I'm confused on this statement. MAP_SHARED exists, but maybe you are referencing a specific kernels' implementation on how they achieve coherence between file backed shared memory regions in two processes? Im not sure.

> Darwin kernel does though.

Sure we can always point to a kernel that has has implemented some feature or another, which is why I said typically you don't see it.

saagarjha · on April 5, 2023

> Darwin kernel does though.

It does not. Compare the implementation of _bcopyout against _platform_memmove, you'll see the difference :)

astrange · on April 5, 2023

Huh, maybe I was thinking of "you can use floating point in the kernel".

That doesn't work in every kernel because they don't want to bother saving/restoring the extra registers.

saagarjha · on April 6, 2023

To be entirely honest I'm not sure why the kernel doesn't use better routines here, I think on ARM at least it saves the entire NEON state on context switch…