Using Cuda filters on cuda::GpuMat obtained from NvBufSurface

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson
• DeepStream Version 5.1
• Issue Type( questions, new requirements, bugs) Question

Hi. I am currently working on developing a custom deepstream plugin in C++. I would like to obtain frames in the cv::cuda::GpuMat format and do some Cuda operations on them.
My initial code was from the sources/gst-plugins/gstdsexample on the SDK. However, in the code there is no instruction for using cv::cuda::GpuMat. That is why I used this code here. But here is the problem:

The code on the forum work with an NvBufSurface named “inter_buf”. The ‘inter_buf’ is an additional surface that is the output of a transformation with “NvBufSurfTransform” on the original NvBufSurface. The example (on the sdk) uses this transformation for cropping and resizing. I don’t want this. I want to use the original NvBufSurface without having an additional one and obtain GpuMat directly from it. So I won’t have any additional NvBufSurfTransform needed.

However, I cannot apply my cuda filter on the GpuMat obtained from this approach. I get Illegal memory access error.

Here is the key section of the code:

    if (NvBufSurfaceMapEglImage (input_buf, 0) !=0 ) {
        return GST_FLOW_ERROR;
    CUresult status;
    CUeglFrame eglFrame;
    CUgraphicsResource pResource = NULL;
    status = cuGraphicsEGLRegisterImage(&pResource,

    status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
    status = cuCtxSynchronize();

    cv::cuda::GpuMat d_mat(gpublur->processing_height, gpublur->processing_width, CV_8UC4, eglFrame.frame.pPitch[0]);

    //This lines gives error on runtime. The gpublur->filter is just a normal GaussianBlur from cudafilters.
    gpublur->filter->apply (d_mat, d_mat);

    status = cuCtxSynchronize();
    status = cuGraphicsUnregisterResource(pResource);

    // Destroy the EGLImage
    NvBufSurfaceUnMapEglImage (input_buf, 0);

This is very similar to the code on the link. Just that I have removed the transformations before and after this section (since I am working with the main surface).

And the error I get is:

terminate called after throwing an instance of 'cv::Exception'
  what():  OpenCV(4.5.1) /opt/nvidia/deepstream/deepstream-5.1/opencvcuda/opencv_contrib-4.5.1/modules/cudafilters/src/cuda/row_filter.hpp:172: error: (-217:Gpu API call) an illegal memory access was encountered in function 'caller'

Aborted (core dumped)

Once again, I am using the NvBufSurface directly from the original buffer.
This is how the in_buf was created:

  memset (&in_map_info, 0, sizeof (in_map_info));
  if (!gst_buffer_map (inbuf, &in_map_info, GST_MAP_READ)) {
    g_print ("Error: Failed to map gst buffer\n");
    goto error;

  surface = (NvBufSurface *);

The code is almost identical to the one posted here. Is working with the original NvBufSurface causing the problem? Should I also use another NvBufSurface with two transformations? (one before and one after applying the cuda filter)

Ok. After spending about a day, I figured out the problem. Since others may come across the same issue, I try to explain it here. Please correct me if you see any misinformation.

When you decide to use cv::cuda::GpuMat, you assume that the initial data is from among the acceptable formats. However, the initial NvBufSurface (which is obtained from the input buffer), uses NV12 format. According to this post.
Therefore, it seems that we have no option but to go through the transform procedure.
So the code in the transform section should look like this:

  1. Get NvBufSurface from the original buffer
  2. Have an additional NvBufSurface as your element properties. Make sure to specify NVBUF_COLOR_FORMAT_RGBA
    in the NvBufSurfaceCreateParams
  3. Do a transformation from part 1 to 2 with NvBufSurfTransform. You can add some crop/scale/resize as well.
    You can also just do the conversion for the sake of color format and have same rectangles.
  4. Get the GpuMat as instructed in the above code or the code here.
  5. When you are done with the mat, you can now do a reversed format conversion (from RGBA, RGB, etc. back to NV12) as it is done at the end of this code. You can use the same transformation config just swap the input and output surfaces and their rectangles (to match their size).

The transformations are both done either with GPU for dGPU or VIC for Jetson. I guess they do not add too much overhead.


EDIT: Thanks to Blard.Theophile’s post, you can also use nvvideconvert for the conversion. This way, the plugin does not have to use NvBufSurfaceTransform.

Hi Mohammad!
If you want to directly process RGBA data in dsexample you can use nvvideoconvert before dsexample to convert the input buffers from NV12 to RGBA.

1 Like

Hey there,
I was just about to try this one for today :D

Thank you for your suggestion.

1 Like

The only drawback is that the buffers will flow as RGBA for the rest of the pipeline, but most of Deepstream elements support NV12 & RGBA anyway (except nvv4l2 encoders).

1 Like

Oh, I see.
But I think it gives a performance boost by making the conversion parallel with the plugin’s logic. (Since it is in a separate plugin).

Just one last thing, does NV12 have any advantage over the RGBA? I mean, I can convert it back to NV12 with another nvvideoconvert after the plugin, if it does.

Thank you for your detailed answers🙏🏻.

I’m not aware of any significant advantage of NV12 over RGBA. I’d say it depends on your pipeline and the capabilities of your elements. Maybe someone at Nvidia can provide more information.

I’m almost always using RGBA, as it simpler to use with OpenCV, and the underlying neurals nets of nvinfer almost always expect RGB input.

1 Like

Hi again,

Thanks to your suggestion, I have implemented a simpler plugin that directly processes RGBA.

But the problem is that the cv::cuda::GaussianFilter is being applied MUCH slower! On a single image on a normal script (outside deepstream framework), the filter is applied with the same method but 5 times faster!

I think it is related to the buffer layout and the pitch linear layout is causing this. I would appreciate it if you have any input on this. I have created a topic here.