Encoding from OpenCV GpuMat and Writing Output to File

Hi All,
I am new to encoding in general, but I am trying to take some manually generated OpenCV cv::cuda::GpuMat NV12M frames, encode them to H265 format, and then write out as a video file. I have been using the jetson_multimedia_api samples as reference, but they all seem to use file input/output. For the time being I want to avoid downloading the GpuMat frames for the sake of speed.

It looks like read_video_frame is where I should be looking at for getting the raw data into the buffers, but I’m not sure how I could perform the same operations with the CUdeviceptr .data component of the GpuMat rather than a file stream. Besides that, it seems that the file output functionality doesn’t need to change much. Is 03_video_cuda_enc the closest match to what I am trying to accomplish?

If I understand the functionality correctly:

  1. Make the GpuMat frames
  2. Perform any necessary NvBuffer GPU preparations/conversions (?)
  3. Enqueue the frames onto the NvBuffers in the output plane
  4. Dequeue the buffers from the capture plane
  5. Write the encoded frame to the file

If your source data is in cv::cuda::GpuMat, you would need to copy the data from GpuMat to NvBufSurface and then send to encoder. This is done through GPU so it should not have much overhead.

Thank you for the quick reply. If it is that simple, then it would certainly be helpful. However, I still do not understand how to use it in conjunction with NvBuffers like how the jetson_multimedia_api samples function.

I found this thread previously which seems to be doing something similar. Are there significant differences between the two methods that I am not understanding?

I should also add that this is JetPack 4.6.1 if it changes anything.

On Jetpack 4.6.1, please use NvBuffer APIs. And this method should work:
Copy OpenCV GpuMat data to an NvBuffer - #9 by sanatmharolkar

NvBuffer in NV12 has two planes. One is Y plane and the other is UV-interleaved plane. You would need to copy data to the two planes individually.
Here is a post about map NV12 NvBuffer to GpuMat:
Real-time CLAHE processing of video, framerate issue. Gstreamer + nvivafilter + OpenCV - #5 by Honey_Patouceul
In your use-case, you are copying GpuMat to NvBuffer. Please refer to the post to handle alignment.

You may do implementation based on 01_video_encode.

Thank you for quick reply again. I hadn’t realized that about the planes, but that is true and seems simple enough to split with opencv. I will try with an implementation similar to 01_video_encode and see how it goes. Thank you for the instructions.

You may see: OpenCV CUDA processing from gstreamer pipeline [JP4, JP5]

1 Like

On Jetpack 4.6.1, please use NvBuffer APIs. And this method should work:
Copy OpenCV GpuMat data to an NvBuffer - #9 by sanatmharolkar

This was helpful with getting the code to work for CPU encoding, thank you.

You may see: OpenCV CUDA processing from gstreamer pipeline [JP4, JP5]

This was helpful in figuring out the process for GPU encoding, thank you. I hadn’t been using your GetPitch function and was getting a video with the right colors but bad alignment until I switched.

Although it works well now, there doesn’t seem to be much of a speedup compared to the CPU method and cudaMemcpyDefault. Approximately 0.2 seconds per 500 hundred frames or so (in total 14.9 seconds for GPU encoding and 15.1 seconds for CPU encoding including setup). For 40 frames, it is only about 0.1 seconds faster.

I notice the time between successive frame writes to the output file did not change in terms of time taken. Are there other measures I need to take for the capture plane as well to utilize GPU encoding?

Please execute sudo tegrastats to check if GPU is at full loading. If it is at full loading, the GPU engine shall offer optimal throughput in the use-case.

For capability of hardware encoder, please check

Unless my math is incorrect, it does seem to be the case that we are reaching the maximum throughput for HEVC encoding. For 4504x4504 images that we use, 40 frames done in about 1.2s seems to have a throughput of 676 MP/s which is close to the maximum listed.

It seems the switch to using GPU encoding just sped up the times to copy into the buffer which helps but might be the extent of things. Thank you for clarifying.

I’ve actually run into another issue in trying to get this to work with standard uchar pointers instead of GpuMats. I had assumed setting eglFrame.frame.pPitch[0] and [1] to the Y and UV planes respectively would work, but the buffer ends up not copying anything.

Is this format/order of function calls with regards to the EGL object handling correct?

cv::cuda::GpuMat d_frame_rgb(4504, 4504, CV_8UC3);
EGLImageKHR eglimage;
eglimage = NvEGLImageFromFd(ctx.eglDisplay, buffer->planes[0].fd);
CUresult status;
CUeglFrame eglFrame;
CUgraphicsResource pResource = NULL;
status = cuGraphicsEGLRegisterImage(&pResource, eglimage, CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
if(status != CUDA_SUCCESS)
    cerr << "cuGraphicsEGLRegisterImage failed\n";
status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
if (status != CUDA_SUCCESS)
    cerr << "cuGraphicsResourceGetMappedEglFrame failed\n";

status = cuCtxSynchronize();
if (status != CUDA_SUCCESS)
    cerr << "cuCtxSynchronize failed\n";

uchar* d_frame_y_uchar;
uchar2* d_frame_uv_uchar;
cudaMalloc(&d_frame_y_uchar, 4504*4608*sizeof(uchar));
cudaMalloc(&d_frame_uv_uchar, (4504/2)*(4608/2)*sizeof(uchar2));
eglFrame.frame.pPitch[0] = (void*)d_frame_y_uchar;
eglFrame.frame.pPitch[1] = (void*)d_frame_uv_uchar;
if(d_frame_y_uchar != eglFrame.frame.pPitch[0])
    cerr << "ERROR copying y frame to EGLFRame object\n";
if(d_frame_uv_uchar != eglFrame.frame.pPitch[1])
    cerr << "ERROR copying uv frame to EGLFRame object\n";

convertRGBtoNV12M(d_frame_rgb, d_frame_y_uchar, d_frame_uv_uchar);
read_video_frame(d_frame_y_uchar, d_frame_uv_uchar, *buffer);

status = cuCtxSynchronize();
if (status != CUDA_SUCCESS)
    cerr << "cuCtxSynchronize 2 failed\n";
status = cuGraphicsUnregisterResource(pResource);
if (status != CUDA_SUCCESS)
    cerr << "cuGraphicsUnregisterResource failed\n";
NvDestroyEGLImage(ctx.eglDisplay, eglimage);
int read_video_frame(uchar* yframe, uchar2* uvframe,  NvBuffer & buffer)
    for(unsigned int i = 0; i < buffer.n_planes; i++){

        NvBuffer::NvBufferPlane &plane = buffer.planes[i];
        if(i == 0){
            cudaMemcpy(plane.data, yframe, plane.fmt.bytesperpixel * plane.fmt.width * plane.fmt.height, cudaMemcpyDeviceToDevice);
            cudaMemcpy(plane.data, uvframe, plane.fmt.bytesperpixel * plane.fmt.width * plane.fmt.height, cudaMemcpyDeviceToDevice);
        plane.bytesused = plane.fmt.bytesperpixel * plane.fmt.width * plane.fmt.height;
    return 0;

I can get it to work without issues with GpuMats like in OpenCV CUDA processing from gstreamer pipeline [JP4, JP5], but something about the uchar pointers results in the capture plane buffers writing nothing to the file. Without using EGLFrames, using uchar pointers only works, so there is something regarding EGLFrames/EGLImages that I don’t seem to understand yet.

I do not know if this is even your case but I once created uchar4* pointer from gpumat (which came from egl) and then I used cuda programming to do some processing on my image.

That is what I was doing to copy the data into the buffer, but I wanted to avoid having to use OpenCV if I could. In writing this, I realized my issue and have fixed it. It was actually just a simple mistake involving pointer usage that I overlooked.

uchar* d_frame_y_uchar;
uchar2* d_frame_uv_uchar;
cudaMalloc(&d_frame_y_uchar, height*pitch*sizeof(uchar));
cudaMalloc(&d_frame_uv_uchar, (height/2)*(pitch/2)*sizeof(uchar2));
//Start loop here
cudaMemcpy(eglFrame.frame.pPitch[0], d_frame_y_uchar, height*pitch*sizeof(uchar), cudaMemcpyDeviceToDevice);
cudaMemcpy(eglFrame.frame.pPitch[1], d_frame_uv_uchar, (height/2)*(pitch/2)*sizeof(uchar2), cudaMemcpyDeviceToDevice);
uchar* new_y = (uchar*)eglFrame.frame.pPitch[0];
uchar2* new_uv = (uchar2*)eglFrame.frame.pPitch[1];
//End loop

The cudaMalloc is before the loop to avoid the unnecessary allocation time for each loop iteration.

1 Like

This looks good, I’ll give it a try in my use case too.
Till now what I was doing is buffer → egl → gpu → uchar4 → preprocessing (cuda kernel, creating black rectangles)

But if I try your method, I do not need to create a GpuMat.

Thanks for sharing!