10 Bit YcbCr Operations in video pipelines


I am building a pipeline with multiple cuda streams, for scaling primary input, doing AI inference on input and encoding HEVC on input.
The input comes per GPU-DirectRDMA from and FPGA and is 10bit YCbCr.
We are using an Quadro RTX4000 GPU under embedded Linux, for which NvMedia or any Jetson Frameworks are not available.

I see very limited support outside the Jetson frameworks to support normal media conversions, especially for 10bit (instead of 8 bit per channel).
E.g. the scaling of the primary video input, which comes in as 4K YCbCr 10 bit stream, and has to be upscaled to 8K resolution, needs to happen in good quality (so bilinear scaling is not enough, it needs to be bicubic or even Lanczos).

The only provided libraries / frameworks I see fit to do such standard operations are either OpenCV or part of NPP.
OpenCV turned out to be quite slow (their kernel implementations are very slow, they internally like to convert all things to BGR, have no 10 bit support but only 8 bit and 16 bit which is wasteful).
NPP has fast implementations, but only supports few formats, e.g. only 8bit and 16bit too.

So what frameworks / libraries are you guys using to do fairly standard media work, e.g. colorspace conversion, cropping, scaling, mixing, blending all in 10bit YcbCr? I didn’t find any good frameworks.
Are you all writing it from scratch with custom kernels?

When doing such compute kernels, are you just running the kernels on the GPU buffer data, or is there any advantage to load the video frames into textures first (when bilinear scaling is not an option, but higher quality is needed)?

Thanks for any hints…

This is completely outside my area of expertise. Is it possible that the Video Codec SDK is the most suitable software resource for this use case, rather than CUDA? There is a specialized sub-forum for video technologies that you might want to check out: https://forums.developer.nvidia.com/c/professional-graphics-and-rendering/video-technologies/184

The Codec SDK is really just for low level encoding, where I can enqueue frames to be encoded and receive the encoded frames.

I cannot use it’s format definitions for other frameworks.
Is there any other framework or how do you guys support custom 10bit YCbCr formats used for video pipelines and to feed CUDA kernels / NPPI functions?