Convert I420 to RGBA using CUDA kernel to use with jetson-inference

Hello,

I have been using jetson-inference as a base for our application development. The repo also provides many cuda kernels for different video format conversion (which is very helpful, thank you!).

However, I notice that there is one kind of conversion missing, which is I420 -> RGBA. I am wondering if it was now implemented due to some specific reason (like the conversion would be not very efficient in terms of image quality) or no.

I tried to write a kernel myself but it turns out that there are some artifacts in the output.

Here is the kernel:

__device__ void yuv2rgb(uint32_t* yuvi, float *red, float* green, float* blue){
    const float luma = float(yuvi[0]);
    const float u    = float(yuvi[1]) - 128.0f;
    const float v    = float(yuvi[2]) - 128.0f;

    *red   = luma + 1.140f * v;
    *green = luma - 0.395f * u - 0.581f * v;
    *blue  = luma + 2.3032f *u;
}

__global__ void I420ToRGBAf(uint32_t* srcImage,  size_t nSourcePitch,
                           float4* dstImage,     size_t nDestPitch,
                           uint32_t width,       uint32_t height) {
    int x, y;
    uint8_t *srcImageU8     = (uint8_t *)srcImage;

    uint32_t processingPitch = nSourcePitch;

    x = blockIdx.x * blockDim.x + threadIdx.x;
    y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width)
        return; //x = width - 1;

    if (y >= height)
        return; // y = height - 1;

    // This link helps understading of the structure
    // https://en.wikipedia.org/wiki/YUV#Y%E2%80%B2UV420p_(and_Y%E2%80%B2V12_or_YV12)_to_RGB888_conversion
    // Can parallalize/ optimize so that there are 4 rgba conversions 
    // as 4 Y's share a single U and V ?
    uint32_t yuvi[3];
    uint32_t offset = processingPitch * height;

    yuvi[0] = srcImageU8[y * processingPitch + x];
    yuvi[1] = srcImageU8[(y * processingPitch)/4 + x/2 + offset];
    yuvi[2] = srcImageU8[(y * processingPitch)/4 + x/2 + offset + offset/4];

    float r, g, b;

    yuv2rgb(&yuvi[0], &r, &g, &b);

    dstImage[y * width + x] = make_float4(r, g, b, 1.0f);

}

cudaError_t cudaI420ToRGBAf(uint8_t* srcDev, size_t srcPitch, float4* destDev,
                            size_t destPitch, size_t width, size_t height) {
    if( !srcDev || !destDev )
        return cudaErrorInvalidDevicePointer;

    if( srcPitch == 0 || destPitch == 0 || width == 0 || height == 0 )
        return cudaErrorInvalidValue;

    const dim3 blockDim(8,8,1);
    //const dim3 gridDim((width+(2*blockDim.x-1))/(2*blockDim.x), (height+(blockDim.y-1))/blockDim.y, 1);
    const dim3 gridDim(iDivUp(width,blockDim.x), iDivUp(height, blockDim.y), 1);

    I420ToRGBAf<<<gridDim, blockDim>>>( (uint32_t*)srcDev, srcPitch, destDev, destPitch, width, height );

    return CUDA(cudaGetLastError());
}

I have attached two images, one as a result of this kernel and one as it should be expected.

Could anyone help me spot my bug or let me know if there is anyting wrong in my fundamental understanding of I420 format?
expected.png
I420-artifact.png

There wasn’t a reason in particular that I skipped that one, other than at the time I needed to use I420 as an output format rather than an input format.

I noticed there is some integer division in the indexing around line 36/37 of your code snippet, does your kernel work ok on image dimensions with power of two or similar? If so, that would point to that division in the indexing as needing the dimensions to be divisible by 4/8/16/32/ect.

Thanks for the quick response!

I tried to test it on 512x512 feed but still see the same artifact.

Can you post a screenshot of the artifacts you are seeing?

Unfortunately I don’t see the root issue right now, perhaps maybe someone else has identified it.

I have attached the images to my first post. I have attached the image with artifact on 512x512 dimensions here.

It seems to me that there is some issue with the indexing because I see horizontal lines and a faint copy of the right half on the left half. Although, cannot figure out where exactly could be the issue.

Thanks!

Hello,

I added print statements like you have in other kernels and observe that there are some blue pixels with odd values of y axis as negative. I do not handle these explicitly in my code so maybe the viewer code (OpenGL display) is changing it to zero? (This was for input format of 256x256).

cuda thread 216 245  78.860001 9.541003 -64.521599
cuda thread 217 245  78.860001 9.541003 -64.521599
cuda thread 218 245  78.000000 6.960004 -66.521599
cuda thread 219 245  79.000000 7.960004 -65.521599
cuda thread 220 245  76.860001 7.936000 -68.824799
cuda thread 221 245  76.860001 7.936000 -68.824799
cuda thread 222 245  76.860001 7.936000 -68.824799
cuda thread 223 245  76.860001 7.936000 -68.824799
cuda thread 216 246  27.420000 23.441999 17.090401
cuda thread 217 246  26.420000 22.441999 16.090401
cuda thread 218 246  25.420000 21.441999 15.090400
cuda thread 219 246  24.420000 20.441999 14.090400
cuda thread 220 246  24.420000 20.047001 16.393600
cuda thread 221 246  23.420000 19.047001 15.393600
cuda thread 222 246  24.420000 20.047001 16.393600
cuda thread 223 246  23.420000 19.047001 15.393600
cuda thread 216 247  79.000000 7.960004 -65.521599
cuda thread 217 247  80.000000 8.960004 -64.521599
cuda thread 218 247  79.000000 7.960004 -65.521599
cuda thread 219 247  78.000000 6.960004 -66.521599
cuda thread 220 247  76.860001 7.936000 -68.824799
cuda thread 221 247  75.860001 6.936000 -69.824799
cuda thread 222 247  76.860001 7.936000 -68.824799
cuda thread 223 247  76.860001 7.936000 -68.824799
cuda thread 48 252  162.860001 91.960999 28.691200
cuda thread 49 252  162.860001 91.960999 28.691200
cuda thread 50 252  162.720001 93.542000 29.691200
cuda thread 51 252  163.720001 94.542000 30.691200
cuda thread 52 252  162.720001 93.936996 27.388000
cuda thread 53 252  162.720001 93.936996 27.388000
cuda thread 54 252  164.860001 94.355995 28.388000
cuda thread 55 252  164.860001 94.355995 28.388000
cuda thread 48 253  143.339996 111.714005 -18.676001
cuda thread 49 253  143.339996 111.714005 -18.676001
cuda thread 50 253  143.339996 111.714005 -18.676001
cuda thread 51 253  143.339996 111.714005 -18.676001
cuda thread 52 253  144.479996 111.133011 -18.676001
cuda thread 53 253  144.479996 111.133011 -18.676001
cuda thread 54 253  146.759995 109.181000 -14.069600
cuda thread 55 253  147.759995 110.181000 -13.069600
cuda thread 48 254  162.720001 93.147003 31.994400
cuda thread 49 254  163.720001 94.147003 32.994400
cuda thread 50 254  165.000000 92.775002 27.388000
cuda thread 51 254  165.000000 92.775002 27.388000
cuda thread 52 254  163.720001 94.542000 30.691200
cuda thread 53 254  163.720001 94.542000 30.691200
cuda thread 54 254  164.860001 93.960999 30.691200
cuda thread 55 254  164.860001 93.960999 30.691200
cuda thread 48 255  142.199997 111.900002 -16.372801
cuda thread 49 255  142.199997 111.900002 -16.372801
cuda thread 50 255  144.339996 112.714005 -17.676001
cuda thread 51 255  143.339996 111.714005 -18.676001
cuda thread 52 255  145.619995 110.157005 -16.372801
cuda thread 53 255  145.619995 110.157005 -16.372801
cuda thread 54 255  147.759995 109.785995 -10.766400
cuda thread 55 255  147.759995 109.785995 -10.766400

Hmm, yes, there should not be pixels with intensity < 0.0 passed to OpenGL, it can lead to undefined behavior.

Great, I clamped the values between 0.0 and 255.0 but this doses not seem to resolve the issue, unfortunately.

Moreover, I am still trying to figure out if I have understood the byte format correctly. I found another reference https://wiki.videolan.org/YUV/#I420, which explains different formats. But this seems a little different than my initial reference https://en.wikipedia.org/wiki/YUV#Y%E2%80%B2UV420p_(and_Y%E2%80%B2V12_or_YV12)_to_RGB888_conversion, doesn’t it?

If this is the case I might have to change the kernel.

As of now, what appears easiest to me is that I understand the NV12 to RGBA conversion and then modify that because it seems that I420 and NV12 have same information, but they are in different order. Is this understanding correct?

Really appreciate help from anyone!

Thanks.

Hi bhargavK,
You are correct. I420 has 3 planes, which are Y plane, U plane and V plane. NV12 has two planes, which are Y plane and UV plane(U V interleaved). Data of I420 and NV12 is the same.

Thanks for the clarification.