Cuda blurring filter running too slow on gstdsexample using GpuMat!

Amin_Parchami · July 17, 2021, 10:58am

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson Xavier
• DeepStream Version 5.1
• Issue Type( questions, new requirements, bugs) question

I have turned the gstdsexample plugin into a plugin that blurs the input RGBA buffers using a cuda filter. The frames are extracted as cv::cuda::GpuMat. However, the FPS of the blurring section is too slow. Around 60 FPS for a 25x25 filter on a 1280x720 sample. This is really slower than it should be, since on a normal case, the blurring on a single frame runs on 300 FPS. I don’t know what am I missing here.

Here is how I extract the GpuMat mat and blur it:

static GstFlowReturn blur_frame (GpuBlurPure * gpublurpure, NvBufSurface *input_buf, gint idx){
    static guint src_width = GST_ROUND_UP_2((unsigned int)gpublurpure->video_info.width);
    static guint src_height = GST_ROUND_UP_2((unsigned int)(gint)gpublurpure->video_info.height);

    /* Prepare for getting the frame using egl. */
    NvBufSurfaceMapEglImage (input_buf, 0);

    CUresult status;
    CUeglFrame eglFrame;
    CUgraphicsResource pResource = NULL;
    cudaFree(0);

    /* The intermediate buffer has only one frame. Hence the index is 0 */
    status = cuGraphicsEGLRegisterImage(&pResource,
		input_buf->surfaceList[idx].mappedAddr.eglImage,
                CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);

    status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
    status = cuCtxSynchronize();

    /* Get the GPU mat from intermediate buffer's eglframe */
    cv::cuda::GpuMat d_mat(src_height, src_width, CV_8UC4, eglFrame.frame.pPitch[0]);
 
    /* Process the Mat or make changes to it.*/
    auto single_start = std::chrono::high_resolution_clock::now();
    gpublurpure->filter->apply (d_mat, d_mat);
    auto single_stop = std::chrono::high_resolution_clock::now();
    // The time difference here is about 0.016 seconds! 

    status = cuCtxSynchronize();
    status = cuGraphicsUnregisterResource(pResource);

    /* Destroy the EGLImage */
    NvBufSurfaceUnMapEglImage (input_buf, 0);

  auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(single_stop - single_start);
  float gpu_process_time = 1 / (duration.count() * 1E-9);
  std::cout<<"FPS: "<<gpu_process_time<<std::endl;
  std::cout<<"TIME: "<<duration.count() * 1E-9<<std::endl;
  return GST_FLOW_OK;
}

Here is how the blur_frame function is being called (in the transfrom_ip() function):


  batch_meta = gst_buffer_get_nvds_batch_meta (inbuf);

  guint i = 0;
  for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next){

      /* Blur the frame */
      blur_frame (gpublurpure, surface, i);
      i++;
  }

Here is a sample pipeline that works:

gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream-5.1/samples/streams/sample_720p.h264 ! \
m.sink_0 nvstreammux name=m batch-size=1 width=1280 height=720 ! nvvideoconvert ! \
"video/x-raw(memory:NVMM),format=RGBA" ! gpublurpure ! nvvideoconvert ! x264enc ! filesink location=blurry.h264

Fiona.Chen · July 22, 2021, 6:39am

cv::cuda::GpuMat is not provided by Nvidia. We don’t know the performance of it.

What is the performance of the pipeline without gpublurpure plugin?

Amin_Parchami · July 22, 2021, 3:00pm

As you can see, I am not measuring the entire pipeline delay. I am just measuring this line. Applying a filter on GPU mat.

Amin_Parchami:

    auto single_start = std::chrono::high_resolution_clock::now();
    gpublurpure->filter->apply (d_mat, d_mat);
    auto single_stop = std::chrono::high_resolution_clock::now();

My question is that why it is about 5 times slower when I do this in a deepstream plugin.

Fiona.Chen · July 23, 2021, 2:31am

We don’t know what happened inside the code, it is implemented by you.

Amin_Parchami · July 24, 2021, 6:54am

The code is an exactly copy of gstdsexample.cpp in the SDK. The only thing that I have added is the GpuMat section where I read the frame from the buffer (with EglImage) and blur it. I have provided the code for this part in my original post. I implemented this part according to Nvidia’s suggestion here.

All I am asking is that why the blurring is being done so slow! It is supposed to be done in GPU (since we have a GpuMat). But, it is 5 times slower than it should be!

DaneLLL · July 27, 2021, 4:32am

Hi,
Please run sudo nvpmodel -m 0 and sudo jetson_clocks. The commands are to run GPU at max clock and should bring performance improvement.

Amin_Parchami · July 29, 2021, 7:32am

Thank you DaneLLL. I will definitely try these.
But on a normal code outside deepstream framework, where I read an image, upload it into GpuMat, and then blurred it, the blur part runs 5 times faster, compared to when I do the same blurring on GpuMat from EGL in deepstream.
I searched for any documentation of how exactly does the EGL part give the GpuMat. Does it, for instance, map a memory address for GPU?
Since this GpuMat extraction is the core of the applications I am about to develop, I really need to know where does this overhead from EGL come from.

Best.

DaneLLL · August 10, 2021, 4:22am

Hi,
Please check if you put blur_frame() in this for loop:

    for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next)

This is executed per detected object number. So one frame may be done for multiple times. If you put the function in the for loop, please move it out to do only one time for each frame.

Topic		Replies	Views
Gstdsexample plugin is slow: does GaussianBlur run on GPU? DeepStream SDK opencv , cuda , gstreamer	17	3556	October 12, 2021
Using Cuda filters on cuda::GpuMat obtained from NvBufSurface DeepStream SDK opencv , cuda , gstreamer	8	2450	July 27, 2021
cuGraphicsEGLRegisterImage, cuGraphicsResourceGetMappedEglFrame, NvBufSurfaceMapEglImage are getting slower after every call DeepStream SDK	3	43	July 24, 2024
Error generated while running the code after connecting the camera Jetson Xavier NX gstreamer , nvbugs	45	1261	January 2, 2024
Gstreamer CUDA Implementation Low FPS, cudaDeviceSynchronize Load Jetson Nano gstreamer	2	1039	October 15, 2021
DsExample plugin performance issue DeepStream SDK	4	643	October 28, 2023
Opencv gpu mat into GStreamer without downloading to cpu Jetson Nano opencv , gstreamer	19	8650	October 13, 2021
Camera tampering using dsexample in jetson nano DeepStream SDK	9	1060	December 28, 2021
How to create opencv gpumat from nvstream? DeepStream SDK	36	14634	July 27, 2021
Delay acumulation with gstreamer and opencv Jetson TX2 gstreamer	6	1389	October 18, 2021

Cuda blurring filter running too slow on gstdsexample using GpuMat!

Related topics