Linux's Stateless H.264 Decode Interface Ready to Be Deemed Stable

Matt3o12_ · on Nov 17, 2020

Can someone explain the purpose of this decoder? As far as I know, decoding H.264 is already pretty solid on Linux and I don't know what benefits of making it stateless there are. I could definitely see why a stateless encoder would be beneficial (i.e. to spread out the load), but isn't decoding h.264 already a solved problem?

NewJazz · on Nov 17, 2020

Certain hardware does not have accelerated video decoding on mainline Linux. In particular, ARM chips with a rockchip VPU. Pinebook Pro is one such device.

stagger87 · on Nov 17, 2020

This isn't a decoder, this is an interface. The benefits of stateless decoding are simpler hardware, and more flexibility in decoding (among others).

kelnos · on Nov 18, 2020

The issue is that hardware decoding is a hodgepodge of various vendor-specific interfaces. Projects like VA-API try to put a nice frontend onto them so applications don't have to care about the hardware, but ultimately that just means twice the work needs to be done to add support for new hardware, and to maintain existing hardware.

If kernel developers writing drivers for hardware video acceleration were to build to a standard set of kernel interfaces, then userspace would just have one interface to work with, simplifying things and reducing overall effort.

robert_foss · on Nov 18, 2020

Good description.

Furthermore stateless H.264 decoders are a specific subset of hardware decoders that, like the name implies, don't maintain any state by themselves.

zerocrates · on Nov 17, 2020

Since H.264 has P-frames and B-frames and so on which I think would be by definition stateful, I assume that processing is just pushed out to the client/userspace somehow?

stormer2000 · on Nov 17, 2020

Stateless refer to the HW Interface. The stateless HW can accept decoding jobs in any order as long as all the information is properly provided (parameters extracted and deduces from the bitstream along with the previously decoded references). As a side effect, it is trivial to multiplex multiple streams using this type of HW.

The V4L2 layer keeps a bit of the state (more like caching, to avoid re-uploading too much information for each jobs). The userspace is responsible for bitstream parsing and DPB management (including re-ordering).

zerocrates · on Nov 17, 2020

Oh so, you just provide whatever reference frames, if any, are needed, and it's just on you to make sure you've decoded what's necessary first? The difference here basically being that the hardware will not do the "bookkeeping"?

stormer2000 · on Nov 17, 2020

Correct.

zerocrates · on Nov 17, 2020

Thanks for explaining.

vlovich123 · on Nov 17, 2020

Are there performance implications of needing to upload the entire state needed for a single frame? Or do none of these encoders have such caching anyway & thus it's just pushing the complex pieces of resource management out to user space where it belongs better?

megous · on Nov 17, 2020

I don't think anything is uploaded anywhere, you just need more RAM to keep frames around as long as they are necessary. The decoder operates on data in system's RAM.

vlovich123 · on Nov 17, 2020

Is that generally true to be faster rather than having dedicated RAM alongside the ASIC? Or are the unit economics not worth it and generally unified memory systems is the current dominating design?

londons_explore · on Nov 17, 2020

Considering most pixels in the reference frames will be read, on average, less than once per generated frame, it makes no sense to have dedicated RAM.

magicalhippo · on Nov 17, 2020

But each reference frame is on average used for many generated frames, no? I mean that's kinda the point of them, isn't it?

megous · on Nov 17, 2020

Input is way smaller than output, so memory performance considerations there probably don't even register in the larger scale of things. (having to write the decompressed frame to RAM and read it again to scan it out to display)

Say 4-8kiB per frame on input leads to 4MiB frame on output.

magicalhippo · on Nov 18, 2020

I admit I'm not familiar with h246, I thought the motion vectors and such was applied on the decompressed reference image. At least that's how we implemented the psedudo-MPEG1 encoder/decoder in class.

Not having to decompress the reference frame for every decoded frame seems like a win.

acchow · on Nov 17, 2020

Should be the reverse - these are pulled into behind the interface so you don't need to deal with them in userspace.

bob1029 · on Nov 17, 2020

I believe this is still an interframe approach, but something regarding the actual underlying code is different than a traditional implementation.

From an information theory perspective, you absolutely must have some way to pass intermediate frames around, otherwise you are just talking about some variant of an intraframe technique and all of the video efficiency losses that would go along with that (i.e. encoding each video frame as JPEG).

stormer2000 · on Nov 17, 2020

All H.264 (and other CODECs) prediction are covered. As stated correctly in previous comment, userspace do the bookkeeping. Kernel is aware of the reference buffers and it's attachements (HW specific buffers). All references needed for predictive decoding (regardless if it's B or P), are programmed for each jobs.

Exception to FMO and ASO mode, which is rarely supported in HW, even FFMPEG sw decoder didn't bother implementing that

megous · on Nov 17, 2020

This is most excellent and huge thanks to everyone working on this. I tried it recently on Pinephone connected via USB-C dock to an external monitor, and it's just awesome what it can do: https://www.youtube.com/watch?v=dHOgVmxH_dA (using gstreamer) Hopefully, the support will start trickling down to ffmpeg and common players, like mpv, now that the API will be stable.

swiley · on Nov 17, 2020

As a pinephone owner who would like to be able to charge the phone while watching videos. What do you have to do to get this working on PostmarketOS? (or will I have to recompile everything?)

megous · on Nov 17, 2020

Just check if pmOS has the latest gstreamer, and then you can try running it with some simple playbin pipeline. Playbin should automatically select the HW decoder.

I used kmssink for testing:

    gst-play-1.0 --use-playbin3 a.mkv --videosink="queue ! kmssink connector-id=52 palne-id=42 plane-properties=s,zpos=2"

(I had to patch the kernel too, because kmssink has trouble selecting a correct plane/connector. But if you'll not use kmssink, you should be fine.)

If you want to use it from within some compositor, you'll need to select some other video sink. You'll probably have less trouble if you use some wayland compositor that supports HW planes properly.

I don't know many players that use gstreamer (is totem still alive?), so hard to tell what GUI to use.

voltagex_ · on Nov 17, 2020

IIRC, on many PinePhones, you'll need to break out the soldering iron first.

[1]: https://wiki.pine64.org/index.php?title=PinePhone_v1.1_-_Bra...

megous · on Nov 17, 2020

Anything from pmOS edition onwards is fine.

exabrial · on Nov 17, 2020

Sorry to ask a dumb question, but why is this in the kernel instead of user space?

ahupp · on Nov 17, 2020

It's a hardware decoder.

octoberfranklin · on Nov 18, 2020

Question is valid. Hardware 3D graphics acceleration is driven almost entirely by userspace libraries (libGL) rather than in the kernel. In what way is video different from graphics to justify this change?

kelnos · on Nov 18, 2020

3D accel is not driven by userspace libraries. libGL (and others) provide a stable userspace interface, which, under the covers, communicates with a kernel driver. Think of it in a similar way as how libc provides userspace interfaces (like open(), read(), write(), close()) that translates to kernel syscalls to get the work done.

Obviously OpenGL is a lot more complex than the syscall example I gave, but the principle is similar.

octoberfranklin · on Nov 18, 2020

> 3D accel is not driven by userspace libraries.

Of course it is. The kernel knows absolutely nothing about the hardware shaders' instruction set or machine code format. The userspace libraries have an entire compiler in them. The kernel just forwards the output of this compiler to the hardware.

ahupp · on Nov 18, 2020

Sure, but the point is that there's a necessary kernel component to provide access control, mediate between different users, abstract across different hardware, etc. I don't know exactly what's in this patch set, but in general I'd be surprised to see hardware that didn't require some kind of kernel interface to access it.

octoberfranklin · on Nov 18, 2020

So, we appear to be drifting off topic here.

The root-level comment is basically asking why the kernel needs to know anything at all about H.264, rather than just shuffling bytes back and forth between a hardware accelerator and a userspace library (which has a gigantic complex H.264 protocol encoder/decoder in it).

I pointed out that in the 3D acceleration world, the kernel doesn't know anything at all about shader ISAs, and just shuffles bytes back and forth between a hardware 3D accelerator and a userspace library (which has a gigantic complex compiler in it).

I still don't feel the question has been addressed. H.264 is hideously complicated, like shader ISAs. Why don't we move all that hideous complexity out of the kernel for H.264 the way we have for shader ISAs?

lunixuser · on Nov 19, 2020

A lot of the OpenGL driver seems to be at the userspace level: https://commons.wikimedia.org/wiki/File:Linux_kernel_and_Ope...

NewJazz · on Nov 17, 2020

This can be really useful for decoding multiple video chat streams.

stormer2000 · on Nov 17, 2020

Indeed, this type of HW allow decoding an unlimited number of streams (to a certain extend it won't be real time, but will still work until you ran out of RAM). Also suspended streams don't use any resources in some firmware that would prevent other streams to decode in HW.

th0ma5 · on Nov 17, 2020

Isn't most everything going on in custom decoding silicon now?

dahfizz · on Nov 17, 2020

Yup, which is exactly why this is in the kernel and not just a userspace library. The kernel is how programs interface with hardware.

throwaway2048 · on Nov 17, 2020

This would be a front-end to said silicon

Thaxll · on Nov 17, 2020

h264 decoding in kernel in 5.11 isn't that too late? My 12y/o laptop can decode h264 in hardware what's the point of adding that in the kernel in 2020?

renewiltord · on Nov 17, 2020

Each qualifier is important, my dude. 'Stateless' is important here.

Explanation in meme format below

--------------------------------

Scientists: Alien life found!

You: Life found? I am life. I've been life for 30 years. This is not a big deal.

saagarjha · on Nov 18, 2020

In certain countries, I'd be alien life too.

MaxBarraclough · on Nov 17, 2020

From what I can tell this will be able to take advantage of hardware acceleration, for that matter I'm not sure that software decoding will be supported at all. The novel point here is the stateless part.

Relevant reading: https://www.kernel.org/doc/html/latest/userspace-api/media/v...

NewJazz · on Nov 17, 2020

I think/hope this will be reused for h265, VP9, and AV1.

stormer2000 · on Nov 17, 2020

VP9 and HEVC (H265) kernel user API exist and are being cleaned up. This takes a lot of time and a lot of testing, so bear with us. We don't have any sillicon with enough spec we could write a driver for that supports AV1 at the moment. When this happens, we'll definitely get that up and running.

Even though most ancient CODEC and it's existing content decodes fine on CPU, the HW decoder uses less power and is better for battery life. This work enables mostly lower power SoC like Allwinner, Rockchip, i.MX8M, RPi4 (HEVC), Mediatek, Microchip, and so on, but also higher capacity chips that can be connted through PCIe to surpass your CPU capacity (Blaize).

Also, understand that difference between the V4L2 and the GPU accelerators. GPU uses command stream channel, which need to be centrally managed. That landed into DRM + Mesa, under the VA-API. DRM drivers could have been an option, but would have required per-HW userspace in Mesa. VA-API also being a miss-fit for some of the sillicon (Hantro based) would have made things more complex then needed.

reggieband · on Nov 17, 2020

> but also higher capacity chips that can be connted through PCIe to surpass your CPU capacity (Blaize).

How do things like Nvidia Nvenc fit in?

NewJazz · on Nov 17, 2020

> bear with us

Certainly. Very happy to see this work progressing. Hope I can video call on my pinebook pro without it burning a whole in my laptop one day haha.

spacemanmatt · on Nov 18, 2020

Thank you for doing this. It burns me up (and laptop as well!) that so much decode goes through the least capable hardware.

megous · on Nov 17, 2020

So that I can enjoy HW decoding on my SBCs that use SoCs for TV boxes (Allwinner H5, H6), on Pinebook Pro, and on Pinephone. Your 12y/o laptop will not do that for me.

eptcyka · on Nov 17, 2020

Yes, and your 12y/o laptop can happily chug along with nvdpau or libva.