Hacker Newsnew | past | comments | ask | show | jobs | submit | stormer2000's commentslogin

All H.264 (and other CODECs) prediction are covered. As stated correctly in previous comment, userspace do the bookkeeping. Kernel is aware of the reference buffers and it's attachements (HW specific buffers). All references needed for predictive decoding (regardless if it's B or P), are programmed for each jobs.

Exception to FMO and ASO mode, which is rarely supported in HW, even FFMPEG sw decoder didn't bother implementing that


VP9 and HEVC (H265) kernel user API exist and are being cleaned up. This takes a lot of time and a lot of testing, so bear with us. We don't have any sillicon with enough spec we could write a driver for that supports AV1 at the moment. When this happens, we'll definitely get that up and running.

Even though most ancient CODEC and it's existing content decodes fine on CPU, the HW decoder uses less power and is better for battery life. This work enables mostly lower power SoC like Allwinner, Rockchip, i.MX8M, RPi4 (HEVC), Mediatek, Microchip, and so on, but also higher capacity chips that can be connted through PCIe to surpass your CPU capacity (Blaize).

Also, understand that difference between the V4L2 and the GPU accelerators. GPU uses command stream channel, which need to be centrally managed. That landed into DRM + Mesa, under the VA-API. DRM drivers could have been an option, but would have required per-HW userspace in Mesa. VA-API also being a miss-fit for some of the sillicon (Hantro based) would have made things more complex then needed.


> but also higher capacity chips that can be connted through PCIe to surpass your CPU capacity (Blaize).

How do things like Nvidia Nvenc fit in?


> bear with us

Certainly. Very happy to see this work progressing. Hope I can video call on my pinebook pro without it burning a whole in my laptop one day haha.


Thank you for doing this. It burns me up (and laptop as well!) that so much decode goes through the least capable hardware.


Indeed, this type of HW allow decoding an unlimited number of streams (to a certain extend it won't be real time, but will still work until you ran out of RAM). Also suspended streams don't use any resources in some firmware that would prevent other streams to decode in HW.


Stateless refer to the HW Interface. The stateless HW can accept decoding jobs in any order as long as all the information is properly provided (parameters extracted and deduces from the bitstream along with the previously decoded references). As a side effect, it is trivial to multiplex multiple streams using this type of HW.

The V4L2 layer keeps a bit of the state (more like caching, to avoid re-uploading too much information for each jobs). The userspace is responsible for bitstream parsing and DPB management (including re-ordering).


Oh so, you just provide whatever reference frames, if any, are needed, and it's just on you to make sure you've decoded what's necessary first? The difference here basically being that the hardware will not do the "bookkeeping"?


Correct.


Thanks for explaining.


Are there performance implications of needing to upload the entire state needed for a single frame? Or do none of these encoders have such caching anyway & thus it's just pushing the complex pieces of resource management out to user space where it belongs better?


I don't think anything is uploaded anywhere, you just need more RAM to keep frames around as long as they are necessary. The decoder operates on data in system's RAM.


Is that generally true to be faster rather than having dedicated RAM alongside the ASIC? Or are the unit economics not worth it and generally unified memory systems is the current dominating design?


Considering most pixels in the reference frames will be read, on average, less than once per generated frame, it makes no sense to have dedicated RAM.


But each reference frame is on average used for many generated frames, no? I mean that's kinda the point of them, isn't it?


Input is way smaller than output, so memory performance considerations there probably don't even register in the larger scale of things. (having to write the decompressed frame to RAM and read it again to scan it out to display)

Say 4-8kiB per frame on input leads to 4MiB frame on output.


I admit I'm not familiar with h246, I thought the motion vectors and such was applied on the decompressed reference image. At least that's how we implemented the psedudo-MPEG1 encoder/decoder in class.

Not having to decompress the reference frame for every decoded frame seems like a win.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: