Efficient memory integration of CUDA NPP functions and HW NvVideoEncoder (HEVC/H264)

Dear Nvidia community,

I have a PCIe camera feeding video images using zero-copy @100 fps, 2064x1544 in real-time. Since the video stream is a BGGR bayer pattern, I have to convert the incoming stream into RGB. After donating a lot of time on cuda OpenCV and ffmpeg, I figured out that the fastest approach is to use the CUDA NPPP libraries for image processing.
Afterwards, I also would like to stream the converted image using the HW accelerated codecs of NvVideoEncoder.
I have the following image processing stages (BGGR->RGB and RGB->YUV420) using NPP:

// Bayer pattern RGGB to RGB conversion using NPP
nppConvStatus = nppiCFAToRGB_8u_C1C3R((Npp8u*)xiImage.bp, 2064, osize, orect, imageRGB_gpu.data, imageRGB_gpu.step,  NPPI_BAYER_BGGR, NPPI_INTER_UNDEFINED);

// RGB to YUV420 conversion using NPP
nppConvStatus = nppiRGBToYUV420_8u_C3P3R(imageRGB_gpu.data,  imageRGB_gpu.step, dstImageYUV420, dstImageStep, dstSize[0]);

Where Npp8u *dstImageYUV420[3] image is initialized using “cudaMalloc” function.
The above code performs the conversion operations successfully in a few microseconds on TX2 Jetpack 3.2.
Now, I would like to pass the GPU image pointer dstImageYUV420 efficiently to the HW video encoder DMA buffers. Could you please provide a code snippet to achieve this goal very fast?

Thanks and best regards,
Burak

Hi,

Have you checked our Multimedia API before?
https://developer.nvidia.com/embedded/jetpack

We provide some samples which can give you some information:

Sensor driver API: V4L2 API enables video decode, encode, format conversion and scaling functionality. V4L2 for encode opens up many features like bit rate control, quality presets, low latency encode, temporal tradeoff, motion vector maps, and more.

Thanks.

Dear AastaLLL,

Thanks for your answer! Of course I am aware of MMAPI. Let’s go deeper with a working code example I have so far.

The following code gets the Bayer->RGB->YUV420 processed (by Nvidia NPP) image from a camera render loop which can go up to 100 fps 2064x1544 resolution and pushes the images to the HW video encoder.

void H26xEncoder::pushFrame(uint8_t** framePlanes, int* framePlaneSizes, int Planes)
{


if (Initialized == true && framePlanes!=NULL) {

     
	// GPU time 1: Measure time for buffer processing + cudaMemcpy
        const int64 start1 = getTickCount();
	
        struct v4l2_buffer v4l2_buf;
        struct v4l2_plane planes[MAX_PLANES];

        memset(&v4l2_buf, 0, sizeof(v4l2_buf));
        memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane));

        v4l2_buf.m.planes = planes;


        // Check if we need dqBuffer first
        if (bufferIndex < MAX_ENCODER_FRAMES &&
             ctx.enc->output_plane.getNumQueuedBuffers() <
             ctx.enc->output_plane.getNumBuffers())
        {
            // The queue is not full, no need to dqBuffer
            // Prepare buffer index for the following qBuffer


	    printf("bufferIndex: %d\n",bufferIndex);
            v4l2_buf.index = bufferIndex++;

	NvBufferCreateParams init_params = {0};
	init_params.width = 2064;
        init_params.height = 1544;
    	init_params.layout = NvBufferLayout_Pitch;
	init_params.colorFormat = NvBufferColorFormat_YUV420;

	if (NvBufferCreateEx(&fd, &init_params)==-1) {

		printf("Failed to create dma_buf\n");

	} else {

		printf("fd=%d\n",fd);
	
	}



        }
        else
        {
	    ctx.enc->output_plane.dqBuffer(v4l2_buf, NULL, NULL, 10); // 10
      
            fd = v4l2_buf.m.planes[0].m.fd;
        
        }

	
	NvBufferParams params;
        NvBufferGetParams(fd, &params);


	status = cuGraphicsEGLRegisterImage(&resource, eglImage, CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD);
	  
	if (status != CUDA_SUCCESS) {
		printf("cuGraphicsEGLRegisterImage failed: %d.\n", status);
	}
	 

	status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, resource, 0, 0);
	   
	if (status != CUDA_SUCCESS) {
		printf("cuGraphicsResourceGetMappedEglFrame failed: %d.\n", status);
	}

	eglImage = NvEGLImageFromFd (display, fd);
	 
	if(eglImage == NULL) {
	        cout << "create eglImage failed" << endl;
	 }


	int bytes_to_read;

	bytes_to_read  =  params.pitch[0] * params.height[0];
	cudaMemcpy((cudaArray_t)eglFrame.frame.pArray[0],framePlanes[0],bytes_to_read,cudaMemcpyDeviceToDevice);
	
	bytes_to_read =  params.pitch[1] * params.height[1];
	cudaMemcpy((cudaArray_t)eglFrame.frame.pArray[1],framePlanes[1],bytes_to_read,cudaMemcpyDeviceToDevice);

	bytes_to_read =  params.pitch[2] * params.height[2];
	cudaMemcpy((cudaArray_t)eglFrame.frame.pArray[2],framePlanes[2],bytes_to_read,cudaMemcpyDeviceToDevice);

        // Push the frame into V4L2.
        v4l2_buf.m.planes[0].m.fd = fd;
        v4l2_buf.m.planes[0].bytesused = 1; // byteused must be non-zero
	ctx.enc->output_plane.qBuffer(v4l2_buf, NULL);


	// Measure time until pushing into buffer
	const double timeSec1 = (getTickCount() - start1) / getTickFrequency();
	cout << "GPU Time 1 : " << timeSec1 * 1000 << " ms" << endl;


   // Measure time for resource unregister function
   const int64 start = getTickCount();
	

    status = cuGraphicsUnregisterResource(resource);
    if (status != CUDA_SUCCESS)
    {
        printf("cuGraphicsEGLUnRegisterResource failed: %d\n", status);
    }
	
  const double timeSec = (getTickCount() - start) / getTickFrequency();
  cout << "GPU Time 2 : " << timeSec * 1000 << " ms" << endl;
 

   


} // end of if initialized==true


} // end of pushFrame function

Getting the processed image via cudaMemcpy and pushing it into v4l2 buffer takes less than a millisecond but “cuGraphicsUnregisterResource” function takes around 15 ms as shown below.

GPU Time 1 : 0.699259 ms
GPU Time 2 : 15.4409 ms
GPU Time 1 : 0.708219 ms
GPU Time 2 : 16.8548 ms
GPU Time 1 : 0.688091 ms
GPU Time 2 : 15.3245 ms
GPU Time 1 : 0.728859 ms
GPU Time 2 : 14.7805 ms
GPU Time 1 : 0.612476 ms
GPU Time 2 : 17.6632 ms

If I don’t call “cuGraphicsUnregisterResource” function at the end, I get memory leaks and sometimes crashes. I would like to reduce this delay to achieve 100 fps rendering+encoding speed.
I will be happy if you help me to resolve this issue.

Thanks and best regards,
Burak

Hi,

cuGraphicsUnregisterResource is also used in tegra_multimedia_api/samples/common/algorithm/cuda/NvCudaProc.cpp.
Could you check if this issue can be reproduced with our MMAPI samples?

Thanks.

Hi bcizmeci,

Could this issue be repro with MMAPI samples?
Any further update can be shared?

Thanks

Dear kayccc,

Thanks a lot for considering my problem!
In my application, I have the following video rendering loop. Is there a similar example that you can direct me specifically?

while(RenderingOn) {
// Here I get the bayer pattern video from a PCIe camera
GetImageFromCam(); 
//Perform format conversion using Nvidia npp as follows
// Bayer pattern RGGB to RGB conversion using NPP
nppConvStatus = nppiCFAToRGB_8u_C1C3R((Npp8u*)xiImage.bp, 2064, osize, orect, imageRGB_gpu.data, imageRGB_gpu.step,  NPPI_BAYER_BGGR, NPPI_INTER_UNDEFINED);
// RGB to YUV420 conversion using NPP
nppConvStatus = nppiRGBToYUV420_8u_C3P3R(imageRGB_gpu.data,  imageRGB_gpu.step, dstImageYUV420, dstImageStep, dstSize[0]);

// Here is the function call
pushFrame(dstImageYUV420,dstSize, 3);

}

Thank and best regards,
Burak

Hi,

Not sure if you are looking for a sample for camera -> NPP.
If yes, you can check our VisionWorks sample:
/usr/share/visionworks/sources/samples/opencv_npp_interop/

Thanks.

Hi Nvidia supporters,

I would like to continue further on this issue. As suggested I can apply cuda or nppi processing on the video stream of my camera by exactly following the Handle_EGLImage function in NvCudaProc.cpp.
However, this Handle_EGLImage function adds some delay which slows down my image processing loop. May be the cause is registering and unregistering the resource in every loop so far I couldn’t find any solution on this problem.

In the following forum discussion I have found another way of doing the same memory passing.
https://devtalk.nvidia.com/default/topic/1020563/jetson-tx1/transfer-video-frames-from-a-pcie-capture-card-to-jetson-tx1-device-memory-for-rt-video-processing-/

They claim that it is also possible to access dma using the following code snippet:

NvBufferCreate(&fd, w,h,NvBufferLayout_Pitch,get_nvbuff_color_fmt(ctx->cam_pixfmt)))
cudaAddrFromFD(fd, &d_a)

Unfortunately, I couldn’t find any samples or function reference to ‘cudaAddrFromFD’ Could you please provide further information on this function?

Another bottleneck could be the following call used for EGLimage copy initialization:

display = eglGetDisplay(EGL_DEFAULT_DISPLAY))

In my application, I initialized the above for 2 cameras as separate class variables but it seems to me that this display resource is shared and it doubles the processing time of images. When I switch back to single camera, the delay is half of it.
I also followed the instructions in the following blog but in my case I only see 1 device, therefore I can not assign separate display resources for my cameras.
https://devblogs.nvidia.com/egl-eye-opengl-visualization-without-x-server/

I don’t display the camera images at all in the application. Through the dma_buf I passed them to HW video encoder.

I would be happy if you are able to address my questions based on your experience.

Best regards,
Burak

Hi,

Before using other API, could you give us a sample about the delay of Handle_EGLImage.
We want to check this internally to see if any possible solution.

Thanks.

Dear AastaLLL,

Thanks for your answer! Do you want a running sample code that calls Handle_EGLImage? I will try to provide it this week.
However, it would be also good if you are able to provide me instructions how I can include and apply “cudaAddrFromFD(fd, &d_a)” in my application.

Thanks for your consideration!

Best regards,
Burak

Hi,

Yes, we want a sample of delay in Handle_EGLImage.
We didn’t pay too much attention on the delay you mentioned before.
Maybe there is still something we can do further optimization.

The cudaAddrFromFD is customized function.
The pipeline should also go through EGL mappling:

Ex.
V4L2_buffer -> EGLImageKHR -> CUDA-Array
(dmabuf_fd) (cuGraphicsEGLRegisterImage) (pDevPtr)

Check our ‘/home/nvidia/tegra_multimedia_api/samples/backend/v4l2_backend_main.cpp’ for detail.
Thanks.