Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Linux's Stateless H.264 Decode Interface Ready to Be Deemed Stable (phoronix.com)
161 points by mfilion on Nov 17, 2020 | hide | past | favorite | 49 comments


Can someone explain the purpose of this decoder? As far as I know, decoding H.264 is already pretty solid on Linux and I don't know what benefits of making it stateless there are. I could definitely see why a stateless encoder would be beneficial (i.e. to spread out the load), but isn't decoding h.264 already a solved problem?


Certain hardware does not have accelerated video decoding on mainline Linux. In particular, ARM chips with a rockchip VPU. Pinebook Pro is one such device.


This isn't a decoder, this is an interface. The benefits of stateless decoding are simpler hardware, and more flexibility in decoding (among others).


The issue is that hardware decoding is a hodgepodge of various vendor-specific interfaces. Projects like VA-API try to put a nice frontend onto them so applications don't have to care about the hardware, but ultimately that just means twice the work needs to be done to add support for new hardware, and to maintain existing hardware.

If kernel developers writing drivers for hardware video acceleration were to build to a standard set of kernel interfaces, then userspace would just have one interface to work with, simplifying things and reducing overall effort.


Good description.

Furthermore stateless H.264 decoders are a specific subset of hardware decoders that, like the name implies, don't maintain any state by themselves.


Since H.264 has P-frames and B-frames and so on which I think would be by definition stateful, I assume that processing is just pushed out to the client/userspace somehow?


Stateless refer to the HW Interface. The stateless HW can accept decoding jobs in any order as long as all the information is properly provided (parameters extracted and deduces from the bitstream along with the previously decoded references). As a side effect, it is trivial to multiplex multiple streams using this type of HW.

The V4L2 layer keeps a bit of the state (more like caching, to avoid re-uploading too much information for each jobs). The userspace is responsible for bitstream parsing and DPB management (including re-ordering).


Oh so, you just provide whatever reference frames, if any, are needed, and it's just on you to make sure you've decoded what's necessary first? The difference here basically being that the hardware will not do the "bookkeeping"?


Correct.


Thanks for explaining.


Are there performance implications of needing to upload the entire state needed for a single frame? Or do none of these encoders have such caching anyway & thus it's just pushing the complex pieces of resource management out to user space where it belongs better?


I don't think anything is uploaded anywhere, you just need more RAM to keep frames around as long as they are necessary. The decoder operates on data in system's RAM.


Is that generally true to be faster rather than having dedicated RAM alongside the ASIC? Or are the unit economics not worth it and generally unified memory systems is the current dominating design?


Considering most pixels in the reference frames will be read, on average, less than once per generated frame, it makes no sense to have dedicated RAM.


But each reference frame is on average used for many generated frames, no? I mean that's kinda the point of them, isn't it?


Input is way smaller than output, so memory performance considerations there probably don't even register in the larger scale of things. (having to write the decompressed frame to RAM and read it again to scan it out to display)

Say 4-8kiB per frame on input leads to 4MiB frame on output.


I admit I'm not familiar with h246, I thought the motion vectors and such was applied on the decompressed reference image. At least that's how we implemented the psedudo-MPEG1 encoder/decoder in class.

Not having to decompress the reference frame for every decoded frame seems like a win.


Should be the reverse - these are pulled into behind the interface so you don't need to deal with them in userspace.


I believe this is still an interframe approach, but something regarding the actual underlying code is different than a traditional implementation.

From an information theory perspective, you absolutely must have some way to pass intermediate frames around, otherwise you are just talking about some variant of an intraframe technique and all of the video efficiency losses that would go along with that (i.e. encoding each video frame as JPEG).


All H.264 (and other CODECs) prediction are covered. As stated correctly in previous comment, userspace do the bookkeeping. Kernel is aware of the reference buffers and it's attachements (HW specific buffers). All references needed for predictive decoding (regardless if it's B or P), are programmed for each jobs.

Exception to FMO and ASO mode, which is rarely supported in HW, even FFMPEG sw decoder didn't bother implementing that


This is most excellent and huge thanks to everyone working on this. I tried it recently on Pinephone connected via USB-C dock to an external monitor, and it's just awesome what it can do: https://www.youtube.com/watch?v=dHOgVmxH_dA (using gstreamer) Hopefully, the support will start trickling down to ffmpeg and common players, like mpv, now that the API will be stable.


As a pinephone owner who would like to be able to charge the phone while watching videos. What do you have to do to get this working on PostmarketOS? (or will I have to recompile everything?)


Just check if pmOS has the latest gstreamer, and then you can try running it with some simple playbin pipeline. Playbin should automatically select the HW decoder.

I used kmssink for testing:

    gst-play-1.0 --use-playbin3 a.mkv --videosink="queue ! kmssink connector-id=52 palne-id=42 plane-properties=s,zpos=2"
(I had to patch the kernel too, because kmssink has trouble selecting a correct plane/connector. But if you'll not use kmssink, you should be fine.)

If you want to use it from within some compositor, you'll need to select some other video sink. You'll probably have less trouble if you use some wayland compositor that supports HW planes properly.

I don't know many players that use gstreamer (is totem still alive?), so hard to tell what GUI to use.


IIRC, on many PinePhones, you'll need to break out the soldering iron first.

[1]: https://wiki.pine64.org/index.php?title=PinePhone_v1.1_-_Bra...


Anything from pmOS edition onwards is fine.


Sorry to ask a dumb question, but why is this in the kernel instead of user space?


It's a hardware decoder.


Question is valid. Hardware 3D graphics acceleration is driven almost entirely by userspace libraries (libGL) rather than in the kernel. In what way is video different from graphics to justify this change?


3D accel is not driven by userspace libraries. libGL (and others) provide a stable userspace interface, which, under the covers, communicates with a kernel driver. Think of it in a similar way as how libc provides userspace interfaces (like open(), read(), write(), close()) that translates to kernel syscalls to get the work done.

Obviously OpenGL is a lot more complex than the syscall example I gave, but the principle is similar.


> 3D accel is not driven by userspace libraries.

Of course it is. The kernel knows absolutely nothing about the hardware shaders' instruction set or machine code format. The userspace libraries have an entire compiler in them. The kernel just forwards the output of this compiler to the hardware.


Sure, but the point is that there's a necessary kernel component to provide access control, mediate between different users, abstract across different hardware, etc. I don't know exactly what's in this patch set, but in general I'd be surprised to see hardware that didn't require some kind of kernel interface to access it.


So, we appear to be drifting off topic here.

The root-level comment is basically asking why the kernel needs to know anything at all about H.264, rather than just shuffling bytes back and forth between a hardware accelerator and a userspace library (which has a gigantic complex H.264 protocol encoder/decoder in it).

I pointed out that in the 3D acceleration world, the kernel doesn't know anything at all about shader ISAs, and just shuffles bytes back and forth between a hardware 3D accelerator and a userspace library (which has a gigantic complex compiler in it).

I still don't feel the question has been addressed. H.264 is hideously complicated, like shader ISAs. Why don't we move all that hideous complexity out of the kernel for H.264 the way we have for shader ISAs?


A lot of the OpenGL driver seems to be at the userspace level: https://commons.wikimedia.org/wiki/File:Linux_kernel_and_Ope...


This can be really useful for decoding multiple video chat streams.


Indeed, this type of HW allow decoding an unlimited number of streams (to a certain extend it won't be real time, but will still work until you ran out of RAM). Also suspended streams don't use any resources in some firmware that would prevent other streams to decode in HW.


Isn't most everything going on in custom decoding silicon now?


Yup, which is exactly why this is in the kernel and not just a userspace library. The kernel is how programs interface with hardware.


This would be a front-end to said silicon


h264 decoding in kernel in 5.11 isn't that too late? My 12y/o laptop can decode h264 in hardware what's the point of adding that in the kernel in 2020?


Each qualifier is important, my dude. 'Stateless' is important here.

Explanation in meme format below

--------------------------------

Scientists: Alien life found!

You: Life found? I am life. I've been life for 30 years. This is not a big deal.


In certain countries, I'd be alien life too.


From what I can tell this will be able to take advantage of hardware acceleration, for that matter I'm not sure that software decoding will be supported at all. The novel point here is the stateless part.

Relevant reading: https://www.kernel.org/doc/html/latest/userspace-api/media/v...


I think/hope this will be reused for h265, VP9, and AV1.


VP9 and HEVC (H265) kernel user API exist and are being cleaned up. This takes a lot of time and a lot of testing, so bear with us. We don't have any sillicon with enough spec we could write a driver for that supports AV1 at the moment. When this happens, we'll definitely get that up and running.

Even though most ancient CODEC and it's existing content decodes fine on CPU, the HW decoder uses less power and is better for battery life. This work enables mostly lower power SoC like Allwinner, Rockchip, i.MX8M, RPi4 (HEVC), Mediatek, Microchip, and so on, but also higher capacity chips that can be connted through PCIe to surpass your CPU capacity (Blaize).

Also, understand that difference between the V4L2 and the GPU accelerators. GPU uses command stream channel, which need to be centrally managed. That landed into DRM + Mesa, under the VA-API. DRM drivers could have been an option, but would have required per-HW userspace in Mesa. VA-API also being a miss-fit for some of the sillicon (Hantro based) would have made things more complex then needed.


> but also higher capacity chips that can be connted through PCIe to surpass your CPU capacity (Blaize).

How do things like Nvidia Nvenc fit in?


> bear with us

Certainly. Very happy to see this work progressing. Hope I can video call on my pinebook pro without it burning a whole in my laptop one day haha.


Thank you for doing this. It burns me up (and laptop as well!) that so much decode goes through the least capable hardware.


So that I can enjoy HW decoding on my SBCs that use SoCs for TV boxes (Allwinner H5, H6), on Pinebook Pro, and on Pinephone. Your 12y/o laptop will not do that for me.


Yes, and your 12y/o laptop can happily chug along with nvdpau or libva.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: