I'm confused why BPF exists in the first place. Can't we just compile kernel modules that hook into the tracing infrastructure?
It seems like a webassembly for the kernel but local software has the benefits of knowing the platform it is running on. I.e. Why compile C code to eBPF, when I can just compile to native code directly?
I can potentially see it solving a permissions problem, where you want to give unprivileged users in a multi-tenant setup the ability to run hooks in the kernel. Is that actually a common use case? I don't think it is.
Yes, you can just compile kernel modules, but you take the risk of crashing the kernel. eBPF provides a safe way to interact with the kernel due to not being turing complete and additional restrictions. Systemtap is another example of such language but compiles to kernel modules instead.
This is quite important when you want to run this code in production. You don't want to accidently crash your kernel.
I'm not sure this argument makes sense. Avoiding accidentally crashing the kernel doesn't require a BPF layer.
For instance, you could just write your kernel module in a sufficiently safe language, like Rust, and have the same benefits. You could even pre-compile eBPF for the exact same level of safety. Still no need for the bpf() system call or the eBPF VM or JIT in the kernel.
* Strictly typed -- registers, and memory are type checked at compilation time. If you use something like Rust, you'd have to bring rustc into the kernel
* Guaranteed to terminate -- you cannot jump backwards, and there is an upper bound on the instruction count
* Bounded memory -- The registers, and accessible memory via maps are a fixed size. We don't have a stack per se.
Compiling Rust to this is possible, but it'd require quite a bit of infrastructure in the kernel to verify that the code is safe, versus the simplicity of eBPF. Early attempts at a general purpose in-kernel VM included passing an AST in, and then doing safety checking on the AST, but they proved too complicated to do safely.
The idea with having eBPF in the kernel is that we can limit the amount of trust given to a particular user-space task.
Accepting compiled stuff in the form of a kernel module requires root privileges and requires that the kernel essentially have complete trust in the code being loaded.
Loading eBPF eliminates the need to trust the process/user doing the loading to that level.
The BPF syscalls don't require cap sys admin. Only specific invocations. You can setup a socket filter without sys admin, and a device or XDP filter with net admin.
Sure but how common is that case? How common are multi-tenant Linux systems with untrusted users that give those specific permissions? Do you want untrusted users sniffing the packets of others?
I love rust but it's not a panacea. It'll prevent memory errors and type errors in a lot of cases but that's not the only way you can crash a kernel. Logic errors and giving the wrong data over the interfaces to the kernel have potential to either kill processes, lock up the kernel or cause it to corrupt data. The ebpf interfaces by design don't suffer from these problems because of their restricted nature. They purposefully say there are things you can't compute here so they don't have to solve the halting problem and various other things!
If a kernel module crashes, you panic the kernel (normally).
eBPF probes can't crash and are determanistically safe (they aren't actually Turing complete). So you are unlikely to heavily impact application performance.
BPF was initially added for packet filtering, iirc. Compiling kernel modules for each filtering rules you'd add would not really work out very well.
Since then, BPF has grown to be used by more subsystems, including tracing, and allows user programs to do advanced (and fast) things. See for example https://github.com/ahupowerdns/secfilter . AFAIK, this doesn't require privileges, which loading a kernel module would.
Having used eBPF/kprobe for work, the main advantage over a precompiled kernel module is convenience. It's much easier to write a C file which hooks a kernel function, then reports that back up to a python script than it is to build and maintain a kernel module and have that talk to some higher level code.
It seems like a webassembly for the kernel but local software has the benefits of knowing the platform it is running on. I.e. Why compile C code to eBPF, when I can just compile to native code directly?
I can potentially see it solving a permissions problem, where you want to give unprivileged users in a multi-tenant setup the ability to run hooks in the kernel. Is that actually a common use case? I don't think it is.