Cuda blurring filter running too slow on gstdsexample using GpuMat!

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson Xavier
• DeepStream Version 5.1
• Issue Type( questions, new requirements, bugs) question

I have turned the gstdsexample plugin into a plugin that blurs the input RGBA buffers using a cuda filter. The frames are extracted as cv::cuda::GpuMat. However, the FPS of the blurring section is too slow. Around 60 FPS for a 25x25 filter on a 1280x720 sample. This is really slower than it should be, since on a normal case, the blurring on a single frame runs on 300 FPS. I don’t know what am I missing here.

Here is how I extract the GpuMat mat and blur it:

static GstFlowReturn blur_frame (GpuBlurPure * gpublurpure, NvBufSurface *input_buf, gint idx){
    static guint src_width = GST_ROUND_UP_2((unsigned int)gpublurpure->video_info.width);
    static guint src_height = GST_ROUND_UP_2((unsigned int)(gint)gpublurpure->video_info.height);

    /* Prepare for getting the frame using egl. */
    NvBufSurfaceMapEglImage (input_buf, 0);

    CUresult status;
    CUeglFrame eglFrame;
    CUgraphicsResource pResource = NULL;

    /* The intermediate buffer has only one frame. Hence the index is 0 */
    status = cuGraphicsEGLRegisterImage(&pResource,

    status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
    status = cuCtxSynchronize();

    /* Get the GPU mat from intermediate buffer's eglframe */
    cv::cuda::GpuMat d_mat(src_height, src_width, CV_8UC4, eglFrame.frame.pPitch[0]);
    /* Process the Mat or make changes to it.*/
    auto single_start = std::chrono::high_resolution_clock::now();
    gpublurpure->filter->apply (d_mat, d_mat);
    auto single_stop = std::chrono::high_resolution_clock::now();
    // The time difference here is about 0.016 seconds! 

    status = cuCtxSynchronize();
    status = cuGraphicsUnregisterResource(pResource);

    /* Destroy the EGLImage */
    NvBufSurfaceUnMapEglImage (input_buf, 0);

  auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(single_stop - single_start);
  float gpu_process_time = 1 / (duration.count() * 1E-9);
  std::cout<<"FPS: "<<gpu_process_time<<std::endl;
  std::cout<<"TIME: "<<duration.count() * 1E-9<<std::endl;
  return GST_FLOW_OK;

Here is how the blur_frame function is being called (in the transfrom_ip() function):

  batch_meta = gst_buffer_get_nvds_batch_meta (inbuf);

  guint i = 0;
  for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next){

      /* Blur the frame */
      blur_frame (gpublurpure, surface, i);

Here is a sample pipeline that works:

gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream-5.1/samples/streams/sample_720p.h264 ! \
m.sink_0 nvstreammux name=m batch-size=1 width=1280 height=720 ! nvvideoconvert ! \
"video/x-raw(memory:NVMM),format=RGBA" ! gpublurpure ! nvvideoconvert ! x264enc ! filesink location=blurry.h264

cv::cuda::GpuMat is not provided by Nvidia. We don’t know the performance of it.

What is the performance of the pipeline without gpublurpure plugin?

As you can see, I am not measuring the entire pipeline delay. I am just measuring this line. Applying a filter on GPU mat.

My question is that why it is about 5 times slower when I do this in a deepstream plugin.

We don’t know what happened inside the code, it is implemented by you.

The code is an exactly copy of gstdsexample.cpp in the SDK. The only thing that I have added is the GpuMat section where I read the frame from the buffer (with EglImage) and blur it. I have provided the code for this part in my original post. I implemented this part according to Nvidia’s suggestion here.

All I am asking is that why the blurring is being done so slow! It is supposed to be done in GPU (since we have a GpuMat). But, it is 5 times slower than it should be!

Please run sudo nvpmodel -m 0 and sudo jetson_clocks. The commands are to run GPU at max clock and should bring performance improvement.

Thank you DaneLLL. I will definitely try these.
But on a normal code outside deepstream framework, where I read an image, upload it into GpuMat, and then blurred it, the blur part runs 5 times faster, compared to when I do the same blurring on GpuMat from EGL in deepstream.
I searched for any documentation of how exactly does the EGL part give the GpuMat. Does it, for instance, map a memory address for GPU?
Since this GpuMat extraction is the core of the applications I am about to develop, I really need to know where does this overhead from EGL come from.


Please check if you put blur_frame() in this for loop:

    for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next)

This is executed per detected object number. So one frame may be done for multiple times. If you put the function in the for loop, please move it out to do only one time for each frame.