Cuda blurring filter running too slow on gstdsexample using GpuMat!

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson Xavier
• DeepStream Version 5.1
• Issue Type( questions, new requirements, bugs) question

I have turned the gstdsexample plugin into a plugin that blurs the input RGBA buffers using a cuda filter. The frames are extracted as cv::cuda::GpuMat. However, the FPS of the blurring section is too slow. Around 60 FPS for a 25x25 filter on a 1280x720 sample. This is really slower than it should be, since on a normal case, the blurring on a single frame runs on 300 FPS. I don’t know what am I missing here.

Here is how I extract the GpuMat mat and blur it:

static GstFlowReturn blur_frame (GpuBlurPure * gpublurpure, NvBufSurface *input_buf, gint idx){
    static guint src_width = GST_ROUND_UP_2((unsigned int)gpublurpure->video_info.width);
    static guint src_height = GST_ROUND_UP_2((unsigned int)(gint)gpublurpure->video_info.height);

    /* Prepare for getting the frame using egl. */
    NvBufSurfaceMapEglImage (input_buf, 0);

    CUresult status;
    CUeglFrame eglFrame;
    CUgraphicsResource pResource = NULL;

    /* The intermediate buffer has only one frame. Hence the index is 0 */
    status = cuGraphicsEGLRegisterImage(&pResource,

    status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
    status = cuCtxSynchronize();

    /* Get the GPU mat from intermediate buffer's eglframe */
    cv::cuda::GpuMat d_mat(src_height, src_width, CV_8UC4, eglFrame.frame.pPitch[0]);
    /* Process the Mat or make changes to it.*/
    auto single_start = std::chrono::high_resolution_clock::now();
    gpublurpure->filter->apply (d_mat, d_mat);
    auto single_stop = std::chrono::high_resolution_clock::now();
    // The time difference here is about 0.016 seconds! 

    status = cuCtxSynchronize();
    status = cuGraphicsUnregisterResource(pResource);

    /* Destroy the EGLImage */
    NvBufSurfaceUnMapEglImage (input_buf, 0);

  auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(single_stop - single_start);
  float gpu_process_time = 1 / (duration.count() * 1E-9);
  std::cout<<"FPS: "<<gpu_process_time<<std::endl;
  std::cout<<"TIME: "<<duration.count() * 1E-9<<std::endl;
  return GST_FLOW_OK;

Here is how the blur_frame function is being called (in the transfrom_ip() function):

  batch_meta = gst_buffer_get_nvds_batch_meta (inbuf);

  guint i = 0;
  for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next){

      /* Blur the frame */
      blur_frame (gpublurpure, surface, i);

Here is a sample pipeline that works:

gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream-5.1/samples/streams/sample_720p.h264 ! \
m.sink_0 nvstreammux name=m batch-size=1 width=1280 height=720 ! nvvideoconvert ! \
"video/x-raw(memory:NVMM),format=RGBA" ! gpublurpure ! nvvideoconvert ! x264enc ! filesink location=blurry.h264

cv::cuda::GpuMat is not provided by Nvidia. We don’t know the performance of it.

What is the performance of the pipeline without gpublurpure plugin?

As you can see, I am not measuring the entire pipeline delay. I am just measuring this line. Applying a filter on GPU mat.

My question is that why it is about 5 times slower when I do this in a deepstream plugin.

We don’t know what happened inside the code, it is implemented by you.

The code is an exactly copy of gstdsexample.cpp in the SDK. The only thing that I have added is the GpuMat section where I read the frame from the buffer (with EglImage) and blur it. I have provided the code for this part in my original post. I implemented this part according to Nvidia’s suggestion here.

All I am asking is that why the blurring is being done so slow! It is supposed to be done in GPU (since we have a GpuMat). But, it is 5 times slower than it should be!

Please run sudo nvpmodel -m 0 and sudo jetson_clocks. The commands are to run GPU at max clock and should bring performance improvement.

Thank you DaneLLL. I will definitely try these.
But on a normal code outside deepstream framework, where I read an image, upload it into GpuMat, and then blurred it, the blur part runs 5 times faster, compared to when I do the same blurring on GpuMat from EGL in deepstream.
I searched for any documentation of how exactly does the EGL part give the GpuMat. Does it, for instance, map a memory address for GPU?
Since this GpuMat extraction is the core of the applications I am about to develop, I really need to know where does this overhead from EGL come from.