stormer2000's comments

stormer2000 · on Nov 17, 2020

All H.264 (and other CODECs) prediction are covered. As stated correctly in previous comment, userspace do the bookkeeping. Kernel is aware of the reference buffers and it's attachements (HW specific buffers). All references needed for predictive decoding (regardless if it's B or P), are programmed for each jobs.

Exception to FMO and ASO mode, which is rarely supported in HW, even FFMPEG sw decoder didn't bother implementing that

stormer2000 · on Nov 17, 2020

VP9 and HEVC (H265) kernel user API exist and are being cleaned up. This takes a lot of time and a lot of testing, so bear with us. We don't have any sillicon with enough spec we could write a driver for that supports AV1 at the moment. When this happens, we'll definitely get that up and running.

Even though most ancient CODEC and it's existing content decodes fine on CPU, the HW decoder uses less power and is better for battery life. This work enables mostly lower power SoC like Allwinner, Rockchip, i.MX8M, RPi4 (HEVC), Mediatek, Microchip, and so on, but also higher capacity chips that can be connted through PCIe to surpass your CPU capacity (Blaize).

Also, understand that difference between the V4L2 and the GPU accelerators. GPU uses command stream channel, which need to be centrally managed. That landed into DRM + Mesa, under the VA-API. DRM drivers could have been an option, but would have required per-HW userspace in Mesa. VA-API also being a miss-fit for some of the sillicon (Hantro based) would have made things more complex then needed.

reggieband · on Nov 17, 2020

> but also higher capacity chips that can be connted through PCIe to surpass your CPU capacity (Blaize).

How do things like Nvidia Nvenc fit in?

NewJazz · on Nov 17, 2020

> bear with us

Certainly. Very happy to see this work progressing. Hope I can video call on my pinebook pro without it burning a whole in my laptop one day haha.

spacemanmatt · on Nov 18, 2020

Thank you for doing this. It burns me up (and laptop as well!) that so much decode goes through the least capable hardware.

stormer2000 · on Nov 17, 2020

Indeed, this type of HW allow decoding an unlimited number of streams (to a certain extend it won't be real time, but will still work until you ran out of RAM). Also suspended streams don't use any resources in some firmware that would prevent other streams to decode in HW.

stormer2000 · on Nov 17, 2020

Stateless refer to the HW Interface. The stateless HW can accept decoding jobs in any order as long as all the information is properly provided (parameters extracted and deduces from the bitstream along with the previously decoded references). As a side effect, it is trivial to multiplex multiple streams using this type of HW.

The V4L2 layer keeps a bit of the state (more like caching, to avoid re-uploading too much information for each jobs). The userspace is responsible for bitstream parsing and DPB management (including re-ordering).

zerocrates · on Nov 17, 2020

Oh so, you just provide whatever reference frames, if any, are needed, and it's just on you to make sure you've decoded what's necessary first? The difference here basically being that the hardware will not do the "bookkeeping"?

stormer2000 · on Nov 17, 2020

Correct.

zerocrates · on Nov 17, 2020

Thanks for explaining.

vlovich123 · on Nov 17, 2020

Are there performance implications of needing to upload the entire state needed for a single frame? Or do none of these encoders have such caching anyway & thus it's just pushing the complex pieces of resource management out to user space where it belongs better?

megous · on Nov 17, 2020

I don't think anything is uploaded anywhere, you just need more RAM to keep frames around as long as they are necessary. The decoder operates on data in system's RAM.

vlovich123 · on Nov 17, 2020

Is that generally true to be faster rather than having dedicated RAM alongside the ASIC? Or are the unit economics not worth it and generally unified memory systems is the current dominating design?

londons_explore · on Nov 17, 2020

Considering most pixels in the reference frames will be read, on average, less than once per generated frame, it makes no sense to have dedicated RAM.

magicalhippo · on Nov 17, 2020

But each reference frame is on average used for many generated frames, no? I mean that's kinda the point of them, isn't it?

megous · on Nov 17, 2020

Input is way smaller than output, so memory performance considerations there probably don't even register in the larger scale of things. (having to write the decompressed frame to RAM and read it again to scan it out to display)

Say 4-8kiB per frame on input leads to 4MiB frame on output.

magicalhippo · on Nov 18, 2020

I admit I'm not familiar with h246, I thought the motion vectors and such was applied on the decompressed reference image. At least that's how we implemented the psedudo-MPEG1 encoder/decoder in class.

Not having to decompress the reference frame for every decoded frame seems like a win.