How to process NvBuffer video data in CUDA

I modified tegra_multimadia_api/samples/00_video_decode and tried to process output video by CUDA.
I cannot get buffer pointer using NvBuffer->map() because it fails.
When I use NvBufferMemMap(), I can get pointer but picture seems not NV12 format even NvBufferParams show as NV12.

Then I used NvBuffer2Raw() it works but when I checked performance, it seems very slow.
using NvBufferMemMap() and use memcpy is 3 times faster.

Using NvBufferMemMap() seems pointing device buffer but format is 8x8 or 16x16 blocked format.
Is there any format reference for device buffer pixel format ?

Hi Nobutaka,
Please refer to tegra_multimedia_api\samples\02_video_dec_cuda

// Create EGLImage from dmabuf fd
    ctx->egl_image = NvEGLImageFromFd(ctx->egl_display, buffer->planes[0].fd);
    if (ctx->egl_image == NULL)
    {
        fprintf(stderr, "Error while mapping dmabuf fd (0x%X) to EGLImage\n",
                 buffer->planes[0].fd);
        return false;
    }

    // Running algo process with EGLImage via GPU multi cores
    HandleEGLImage(&ctx->egl_image);

    // Destroy EGLImage
    NvDestroyEGLImage(ctx->egl_display, ctx->egl_image);
    ctx->egl_image = NULL;

NvBufferMemMap() is to get CPU buffer pointer, not GPU buffer pointer.

I have tried that also.

In NvCudaProc.cpp

if (eglFrame.frameType == CU_EGL_FRAME_TYPE_PITCH)

This type was not CU_EGL_FRAME_TYPE_PITCH.
And I tried Array type Pointer to copy to buffer.
Cpied data was all 0.

Is there anothor way or can you tell picture format of device buffer
which is mapped by NvBufferMemMap().

The best way is directly get device buffer pointer mapped to CUDA memory.
Is threre any way to get that pointer ?

If your eglFrame type is not pitch, please use something similar to below to access.

(cudaArray_t)dst_cudaEGLFrame.frame.pArray[i]

Yes I have accessed (cudaArray_t)dst_cudaEGLFrame.frame.pArray[i]
But I cannot get proper data it was all zero.

Does “NvBuffer2Raw()” source code presenting to the public ?
I want to do NvBuffer2Raw() is doing and I want to process them in CUDA to convert to float RGBA.

Could you share your code?

Please share a patch on 02_video_dec_cuda, we have verified this sample and it should be deviation between the sample and your implementation.

I have already deleted code for using NvEGLImageFromFd()
Is is fast enough than using NvBuffer2Raw() ?

What I want to do is Decode H264 modifying sample code 00_video_decode
and convert decoded picture to float RGBA as fast as I can.
That need to process decoded picture without copying and process by CUDA.
Is there any way to do that ?

Hi,
NvBuffer2Raw() gives you a CPU buffer instead of GPU buffer.

We suggest you call

decoded YUV420 -> NvVideoConverter -> RGBA - NvEGLImageFromFd() -> GPU buffer pointer -> CUDA

You can refer to 04_video_dec_trt and backend
Also you can use NVBufferTransform() to replace NvVideoConverter. The HW engine(VIC) is same for both.

Recreated test code modified 00_video_decode

        // Dequeue a filled buffer
        if (dec->capture_plane.dqBuffer(v4l2_buf, &dec_buffer, NULL, 0))
        {
            if (errno == EAGAIN)
            {
                usleep(1000);
            }
            else
            {
                abort(ctx);
                cerr << "Error while calling dequeue at capture plane" <<
                    endl;
            }
            break;
        }

        EGLImageKHR egl_image;
        // Create EGLImage from dmabuf fd
        egl_image = NvEGLImageFromFd(egl_display, dec_buffer->planes[0].fd);
        if (egl_image == NULL)
        {
            fprintf(stderr, "Error while mapping dmabuf fd (0x%X) to EGLImage\n",
                     dec_buffer->planes[0].fd);
        }
        else
        {
            CUresult status;
            CUeglFrame eglFrame;
            CUgraphicsResource pResource = NULL;

            cudaFree(0);
            status = cuGraphicsEGLRegisterImage(&pResource, egl_image,
                        CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
            if (status != CUDA_SUCCESS)
            {
                printf("cuGraphicsEGLRegisterImage failed: %d, cuda process stop\n",
                                status);
            }
            else
            {
                status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
                if (status != CUDA_SUCCESS)
                {
                    printf("cuGraphicsSubResourceGetMappedArray failed\n");
                }

                status = cuCtxSynchronize();
                if (status != CUDA_SUCCESS)
                {
                    printf("cuCtxSynchronize failed\n");
                }

                if (eglFrame.frameType == CU_EGL_FRAME_TYPE_PITCH)
                {
                    //Rect label in plan Y, you can replace this with any cuda algorithms.

// addLabels((CUdeviceptr) eglFrame.frame.pPitch[0], eglFrame.pitch);
printf(“pitch\n”);
}
else if (eglFrame.frameType == CU_EGL_FRAME_TYPE_ARRAY)
{
static int counter = 0;
printf(“array %dx%d %d %d %d\n”, eglFrame.width, eglFrame.height, eglFrame.planeCount, eglFrame.cuFormat, eglFrame.eglColorFormat);
std::vector vec(eglFrame.width * eglFrame.height);
cudaMemcpy(&vec[0], eglFrame.frame.pArray[0], vec.size(), cudaMemcpyDeviceToHost);
if (++counter == 60)
{
std::ofstream ofs(“test.gray”, std:: ios::out | std::ios::binary);
ofs.write(&vec[0], vec.size());
}
}
else
{
printf(“unknown\n”);
}
status = cuCtxSynchronize();
if (status != CUDA_SUCCESS)
{
printf(“cuCtxSynchronize failed after memcpy\n”);
}

                status = cuGraphicsUnregisterResource(pResource);
                if (status != CUDA_SUCCESS)
                {
                    printf("cuGraphicsEGLUnRegisterResource failed: %d\n", status);
                }
            }
            // Destroy EGLImage
            NvDestroyEGLImage(egl_display, egl_image);
        }

“test.gray” data shows all zero

Thank you I’ll see 04_video_dec_trt

This one may not work, please use cudaCreateTextureObject or SurfaceObject with cudaResourceDesc and then use cudaMemcpy

cudaMemcpy(&vec[0], eglFrame.frame.pArray[0], vec.size(), cudaMemcpyDeviceToHost);

if what I want to do is:
decoded YUV420 → then use cuda to scale Y to half_size, convert to float16 → tensorrt

Can I skip nvvideoconverter, and EGL part, directly get a pointer from decoder output that can be used by Cuda?

Hi guo,

This is not possible because decoder outputs in block linear format. You need to convert to pitch linear format.

what is “block linear format”?

I cannot give more. It is confidential.

If I don’t really care about display, just for AI inference, which approach will give me the best performance?

  1. take 00_video_decode as example, mmap to CPU, do CPU->GPU memory copy to TensorRT engine, inference.
  2. take 04_video_dec_trt as example, go through conv, egl, tensorrt engine

Thanks!

04_video_dec_trt is the sample for this case.

If I follow 04_video_dec_trt, can I avoid using AGBR32 format? Because I don’t need to display the image, I only want use the data to drive inference.

So will it work:

  • NvBufferTransform YUV420 to a smaller image in YUV420 format.
  • EGCImageFromFd to get GPU memory pointer. assume data is still in YUV420 planer format
  • start cuda kernel, convert YUV420 to float16 format. planer, format float16[3][Y][U][V].

Core question is: can EGLImage handle YUV420?

Hi guo,
It should work. Please refer to 02_video_dec_cuda for how to access NvBuffer via CUDA. And below sample is a similar format conversion for your reference.
https://github.com/dusty-nv/jetson-inference/blob/master/util/cuda/cudaYUV-NV12.cu