Sort of, but misses some of the larger picture. The main reason that fs code is ...

manyworlds · on Dec 4, 2019

Hmm I don’t think this is the case, at least not when comparing fuse to in-kernel file systems. Having access to native VM structures only helps to the extent that you can avoid copies, yet in fuse, only one extra copy takes place. I think having to switch tasks (and associated work: swapping mm, flushing tlb, ireting/syscalling, synchronization) is really what kills perf

monocasa · on Dec 4, 2019

Torvalds disagrees with you, at least wrt to the fundamental limitation here (there may be other issues layered on top of it of course).

> No, you need not just the blocks, you need the actual cache chain data structures themselves. For doing things like good read-ahead, you need to be able to (efficiently) look up trivial things like "is that block already in the cache".

> So you need not only the data, you need the _tags_ too.

> In other words, your filesystem needs to have access to the whole disk cache layer, not just the contents. Or it will not perform well.

https://yarchive.net/comp/microkernels.html

Edit: and the context of this discussion was fairly ancient systems with tagged TLBs and simple in order cores with syscalls nearly as cheap as regular user space call instructions. They were still ungodly slow with microkernels, and his explanation is the meat of his view as to why. It's all about having the data in the right place with as little synchronization required.

manyworlds · on Dec 4, 2019

Ah okay, fine grained control of cache to minimize IO waiting is a good counterpoint.

I actually have a lot of experience in this area, and I can say that effective readahead is a bit of a crapshoot. Only really works in trivial cases. Ultimately if IO latency sucks, nothing can save you.

His particular point doesn’t fully make sense either. It’s easy to kick off readahead when you only have access to block data, the kernel won’t issue redundant IO requests for blocks already in the cache. Also mlock/madvise give a lot of control in terms of dictating eviction strategies for special blocks.

All thins equal (costless syscall/mm swapping, IO), I still think inter-task synchronization is the largest overhead, but I have no numbers to back it up. Something tells me Marshalling all IO syscalls to a kernel thread would be about as slow as a user-space FUSE task.