How to process NvBuffer video data in CUDA

nobutaka.mitsuma · May 12, 2018, 1:19am

I modified tegra_multimadia_api/samples/00_video_decode and tried to process output video by CUDA.
I cannot get buffer pointer using NvBuffer->map() because it fails.
When I use NvBufferMemMap(), I can get pointer but picture seems not NV12 format even NvBufferParams show as NV12.

Then I used NvBuffer2Raw() it works but when I checked performance, it seems very slow.
using NvBufferMemMap() and use memcpy is 3 times faster.

Using NvBufferMemMap() seems pointing device buffer but format is 8x8 or 16x16 blocked format.
Is there any format reference for device buffer pixel format ?

DaneLLL · May 14, 2018, 2:03am

Hi Nobutaka,
Please refer to tegra_multimedia_api\samples\02_video_dec_cuda

// Create EGLImage from dmabuf fd
    ctx->egl_image = NvEGLImageFromFd(ctx->egl_display, buffer->planes[0].fd);
    if (ctx->egl_image == NULL)
    {
        fprintf(stderr, "Error while mapping dmabuf fd (0x%X) to EGLImage\n",
                 buffer->planes[0].fd);
        return false;
    }

    // Running algo process with EGLImage via GPU multi cores
    HandleEGLImage(&ctx->egl_image);

    // Destroy EGLImage
    NvDestroyEGLImage(ctx->egl_display, ctx->egl_image);
    ctx->egl_image = NULL;

NvBufferMemMap() is to get CPU buffer pointer, not GPU buffer pointer.

nobutaka.mitsuma · May 14, 2018, 5:04am

I have tried that also.

In NvCudaProc.cpp

if (eglFrame.frameType == CU_EGL_FRAME_TYPE_PITCH)

This type was not CU_EGL_FRAME_TYPE_PITCH.
And I tried Array type Pointer to copy to buffer.
Cpied data was all 0.

Is there anothor way or can you tell picture format of device buffer
which is mapped by NvBufferMemMap().

The best way is directly get device buffer pointer mapped to CUDA memory.
Is threre any way to get that pointer ?

WayneWWW · May 14, 2018, 5:58am

If your eglFrame type is not pitch, please use something similar to below to access.

(cudaArray_t)dst_cudaEGLFrame.frame.pArray[i]

nobutaka.mitsuma · May 14, 2018, 6:06am

Yes I have accessed (cudaArray_t)dst_cudaEGLFrame.frame.pArray[i]
But I cannot get proper data it was all zero.

Does “NvBuffer2Raw()” source code presenting to the public ?
I want to do NvBuffer2Raw() is doing and I want to process them in CUDA to convert to float RGBA.

WayneWWW · May 14, 2018, 6:16am

Could you share your code?

DaneLLL · May 14, 2018, 6:26am

Please share a patch on 02_video_dec_cuda, we have verified this sample and it should be deviation between the sample and your implementation.

nobutaka.mitsuma · May 14, 2018, 6:31am

I have already deleted code for using NvEGLImageFromFd()
Is is fast enough than using NvBuffer2Raw() ?

What I want to do is Decode H264 modifying sample code 00_video_decode
and convert decoded picture to float RGBA as fast as I can.
That need to process decoded picture without copying and process by CUDA.
Is there any way to do that ?

DaneLLL · May 14, 2018, 7:20am

Hi,
NvBuffer2Raw() gives you a CPU buffer instead of GPU buffer.

We suggest you call

decoded YUV420 -> NvVideoConverter -> RGBA - NvEGLImageFromFd() -> GPU buffer pointer -> CUDA

You can refer to 04_video_dec_trt and backend
Also you can use NVBufferTransform() to replace NvVideoConverter. The HW engine(VIC) is same for both.

nobutaka.mitsuma · May 14, 2018, 7:36am

Recreated test code modified 00_video_decode

        // Dequeue a filled buffer
        if (dec->capture_plane.dqBuffer(v4l2_buf, &dec_buffer, NULL, 0))
        {
            if (errno == EAGAIN)
            {
                usleep(1000);
            }
            else
            {
                abort(ctx);
                cerr << "Error while calling dequeue at capture plane" <<
                    endl;
            }
            break;
        }

        EGLImageKHR egl_image;
        // Create EGLImage from dmabuf fd
        egl_image = NvEGLImageFromFd(egl_display, dec_buffer->planes[0].fd);
        if (egl_image == NULL)
        {
            fprintf(stderr, "Error while mapping dmabuf fd (0x%X) to EGLImage\n",
                     dec_buffer->planes[0].fd);
        }
        else
        {
            CUresult status;
            CUeglFrame eglFrame;
            CUgraphicsResource pResource = NULL;

            cudaFree(0);
            status = cuGraphicsEGLRegisterImage(&pResource, egl_image,
                        CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
            if (status != CUDA_SUCCESS)
            {
                printf("cuGraphicsEGLRegisterImage failed: %d, cuda process stop\n",
                                status);
            }
            else
            {
                status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
                if (status != CUDA_SUCCESS)
                {
                    printf("cuGraphicsSubResourceGetMappedArray failed\n");
                }

                status = cuCtxSynchronize();
                if (status != CUDA_SUCCESS)
                {
                    printf("cuCtxSynchronize failed\n");
                }

                if (eglFrame.frameType == CU_EGL_FRAME_TYPE_PITCH)
                {
                    //Rect label in plan Y, you can replace this with any cuda algorithms.

// addLabels((CUdeviceptr) eglFrame.frame.pPitch[0], eglFrame.pitch);
printf(“pitch\n”);
}
else if (eglFrame.frameType == CU_EGL_FRAME_TYPE_ARRAY)
{
static int counter = 0;
printf(“array %dx%d %d %d %d\n”, eglFrame.width, eglFrame.height, eglFrame.planeCount, eglFrame.cuFormat, eglFrame.eglColorFormat);
std::vector vec(eglFrame.width * eglFrame.height);
cudaMemcpy(&vec[0], eglFrame.frame.pArray[0], vec.size(), cudaMemcpyDeviceToHost);
if (++counter == 60)
{
std::ofstream ofs(“test.gray”, std:: ios::out | std::ios::binary);
ofs.write(&vec[0], vec.size());
}
}
else
{
printf(“unknown\n”);
}
status = cuCtxSynchronize();
if (status != CUDA_SUCCESS)
{
printf(“cuCtxSynchronize failed after memcpy\n”);
}

                status = cuGraphicsUnregisterResource(pResource);
                if (status != CUDA_SUCCESS)
                {
                    printf("cuGraphicsEGLUnRegisterResource failed: %d\n", status);
                }
            }
            // Destroy EGLImage
            NvDestroyEGLImage(egl_display, egl_image);
        }

“test.gray” data shows all zero

nobutaka.mitsuma · May 14, 2018, 7:40am

Thank you I’ll see 04_video_dec_trt

WayneWWW · May 14, 2018, 9:28am

This one may not work, please use cudaCreateTextureObject or SurfaceObject with cudaResourceDesc and then use cudaMemcpy

cudaMemcpy(&vec[0], eglFrame.frame.pArray[0], vec.size(), cudaMemcpyDeviceToHost);

guo.tang · October 17, 2018, 4:42am

DaneLLL:

Hi,
NvBuffer2Raw() gives you a CPU buffer instead of GPU buffer.

We suggest you call
decoded YUV420 -> NvVideoConverter -> RGBA - NvEGLImageFromFd() -> GPU buffer pointer -> CUDA
You can refer to 04_video_dec_trt and backend
Also you can use NVBufferTransform() to replace NvVideoConverter. The HW engine(VIC) is same for both.

if what I want to do is:
decoded YUV420 → then use cuda to scale Y to half_size, convert to float16 → tensorrt

Can I skip nvvideoconverter, and EGL part, directly get a pointer from decoder output that can be used by Cuda?

DaneLLL · October 17, 2018, 4:48am

Hi guo,

This is not possible because decoder outputs in block linear format. You need to convert to pitch linear format.

guo.tang · October 17, 2018, 5:08am

what is “block linear format”?

DaneLLL · October 17, 2018, 5:43am

I cannot give more. It is confidential.

guo.tang · October 17, 2018, 5:06pm

If I don’t really care about display, just for AI inference, which approach will give me the best performance?

take 00_video_decode as example, mmap to CPU, do CPU->GPU memory copy to TensorRT engine, inference.
take 04_video_dec_trt as example, go through conv, egl, tensorrt engine

Thanks!

DaneLLL · October 17, 2018, 10:27pm

04_video_dec_trt is the sample for this case.

guo.tang · October 18, 2018, 12:36am

If I follow 04_video_dec_trt, can I avoid using AGBR32 format? Because I don’t need to display the image, I only want use the data to drive inference.

So will it work:

NvBufferTransform YUV420 to a smaller image in YUV420 format.
EGCImageFromFd to get GPU memory pointer. assume data is still in YUV420 planer format
start cuda kernel, convert YUV420 to float16 format. planer, format float16[3][Y][U][V].

Core question is: can EGLImage handle YUV420?

DaneLLL · October 18, 2018, 2:24am

Hi guo,
It should work. Please refer to 02_video_dec_cuda for how to access NvBuffer via CUDA. And below sample is a similar format conversion for your reference.
https://github.com/dusty-nv/jetson-inference/blob/master/util/cuda/cudaYUV-NV12.cu