Chroma Subsampling in YUV420_888

Hello,

I receive YUV420_888 (NV12) data from Argus and I want to make sure I convert it correctly to RGBA.

I am performing the conversion in a small CUDA kernel, using the coefficients I found here:
https://devtalk.nvidia.com/default/topic/1047874/jetson-tx1/-solved-argus-convert-yuv420-ycbcr420_888-to-rgb/post/5320893/

The conversion seems to be correct, but I know that there are a number of chroma subsampling schemes (see attached image for an example), and I can’t figure out which one applies here.

I am currently mapping a CUDA Texture Object on each plane of the YUV420_888 image. The filter mode is cudaFilterModeLinear and the read mode is cudaReadModeNormalizedFloat. I’m using edge clamping.

Right now, I am sampling as follows:

const float gx = threadIdx.x + blockIdx.x * blockDim.x;
if (gx >= nDstWidth) return;
const float gy = threadIdx.y + blockIdx.y * blockDim.y;
if (gy >= nDstHeight) return;

auto lumaSample =  tex2D<float>(texLuma, gx + 0.5f, gy + 0.5f);
auto chromaSample = tex2D<float2>(texChroma, ((gx + 0.5f) / 2), ((gy + 0.5f) / 2));

Using the math above:
For pixel (0, 0), I am sampling luma at (0.5, 0.5) and chroma at (0.25, 0.25)
For pixel (1, 0), I am sampling luma at (1.5, 0.5) and chroma at (0.75, 0.25)
For pixel (0, 1), I am sampling luma at (0.5, 1.5) and chroma at (0.25, 0.75)
For pixel (1, 1), I am sampling luma at (1.5, 1.5) and chroma at (0.75, 0.75)

The consequence of this is that I am always blending adjacent chroma samples for every pixel. I’m assuming the chroma is located between the luma samples, and is never co-located with them, so I never use a chroma value directly.

Is this the correct thing to do?
YCBCR420.GIF

Looking at the NV12toBGRandResize in the CUDA samples, it would seem that in my example above, I should be using the same chroma value (no blending) for all, as follows:

For pixel (0, 0), sample luma at (0.5, 0.5) and chroma at (0.5, 0.5)
For pixel (1, 0), sample luma at (1.5, 0.5) and chroma at (0.5, 0.5)
For pixel (0, 1), sample luma at (0.5, 1.5) and chroma at (0.5, 0.5)
For pixel (1, 1), sample luma at (1.5, 1.5) and chroma at (0.5, 0.5)

I’m more likely to trust NVIDIA’s example. So, based on that example, I should be doing this:

const int gx = threadIdx.x + blockIdx.x * blockDim.x;
const int gy = threadIdx.y + blockIdx.y * blockDim.y;
if ((gx >= nDstWidth) || (gy >= nDstHeight)) return;

float lumaX = static_cast<float>(gx) + 0.5f;
float lumaY = static_cast<float>(gy) + 0.5f;
float chromaX = static_cast<float>(gx / 2) + 0.5f;
float chromaY = static_cast<float>(gy / 2) + 0.5f;

auto lumaSample =  static_cast<float>(tex2D<uint8_t>(texLuma, lumaX, lumaY));
auto chromaSample = tex2D<uchar2>(texChroma, chromaX, chromaY);

… and I should also be using cudaFilterModePoint and cudaReadModeElementType.
… and maybe compute several points in one go.

Comments?

Hi,
There are CUDA code of doing format conversion in
https://github.com/dusty-nv/jetson-utils/tree/798c416c175d509571859c9290257bd5cce1fd63/cuda

Please take a look. Also you can call NvBufferTransform() to do format conversion. It utilizes hardware converter VIC.