Gstdsexample plugin is slow: does GaussianBlur run on GPU?

ynjiun · August 21, 2020, 3:58am

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson AGX Xavier
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only) 4.4

I am running gstdsexample plugin and turn on objects_blur=true. Once blur turn on, the frame rate drop significantly. My question: does this imply the example only show GaussianBlur running on CPU? not GPU?

If yes, how can I modify the example code to run GaussianBlur in GPU to speedup the frame rate?

Please advise. Thanks a lot.

DaneLLL · August 21, 2020, 6:24am

Hi,
Please check source cod eof dsexample in

/opt/nvidia/deepstream/deepstream-5.0/sources/gst-plugins/gst-dsexample/gstdsexample.cpp

It calls GaussianBlur() which is done on CPU:
https://docs.opencv.org/3.4/d4/d86/group__imgproc__filter.html#gaabe8c836e97159a9193fb0b11ac52cf1

You may use the script to enable cuda filter:
https://github.com/AastaNV/JEP/blob/master/script/install_opencv4.3.0_Jetson.sh
And refer to the sample:

to create CUDA Gaussian filter:
https://docs.opencv.org/master/dc/d66/group__cudafilters.html#gaa4df286369114cfd4b144ae211f6a6c8

ynjiun · August 22, 2020, 10:36pm

How to create opencv gpumat from nvstream?

# Add opencv4 to CFLAGS and LIBS
CFLAGS+= -I /usr/local/include/opencv4
LIBS+=-L/usr/local/lib -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_videoio -lopencv_cudafilters

Hi Dane, Thank you for the information. I follow the steps to install cuda filter then modify get_converted_mat() code and Makefile accordingly and compile the code. I got the following error message:

paul@agx:~/vidar/gst-dsexample$ make
-fPIC -DDS_VERSION=“5.0.0” -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include
g++ -c -o gstdsexample.o -fPIC -DDS_VERSION="5.0.0" -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include gstdsexample.cpp
In file included from gstdsexample.cpp:15:0:
gstdsexample.h:39:10: fatal error: opencv2/cudafilters.hpp: No such file or directory
include <opencv2/cudafilters.hpp>
^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Makefile:79: recipe for target ‘gstdsexample.o’ failed
make: *** [gstdsexample.o] Error 1

I found the cudafilters.hpp in my install folder (~/Documents/cuda_python/install/opencv_contrib-4.3.0/modules/cudafilters/include/opencv2/cudafilters.hpp). This is because I copy install_opencv4.3.0_Jetson.sh into my ~/Documents/cuda_python directory and mkdir install, then say:
./install_opencv4.3.0_Jetson.sh install
after 90 minutes, everything compiled without error in my AGX Xavier.
Did I specify the install folder incorrectly? What installation folder I should specify?
I also notice in the install_opencv4.3.0_Jetson.sh, there are 3 lines been commented out:

sudo make install
echo ‘export PYTHONPATH=$PYTHONPATH:’$PWD’/python_loader/’ >> ~/.bashrc
#source ~/.bashrc

are these correct? or we need to uncomment out these 3 lines to install all the files to proper directory?

or I missing something else?

Please advise. Thanks a lot for your help.

DaneLLL · August 24, 2020, 9:36am

Hi,
Please execute sudo make install. This should install the header files and built libraries.

paulwang · August 24, 2020, 9:39pm

Thank you Dane. After I sudo make install, most of the problem gone, but still encounter following error message when make:

paul@agx:~/vidar/gst-dsexample$ make
-fPIC -DDS_VERSION=“5.0.0” -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/local/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include
g++ -c -o gstdsexample.o -fPIC -DDS_VERSION="5.0.0" -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/local/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include gstdsexample.cpp
-fPIC -DDS_VERSION=“5.0.0” -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/local/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include
g++ -o libnvdsgst_dsexample.so gstdsexample.o -shared -Wl,-no-undefined -L dsexample_lib -ldsexample -L/usr/local/cuda-10.2/lib64/ -lcudart -ldl -lnppc -lnppig -lnpps -lnppicc -lnppidei -L/opt/nvidia/deepstream/deepstream-5.0/lib/ -lnvdsgst_helper -lnvdsgst_meta -lnvds_meta -lnvbufsurface -lnvbufsurftransform -Wl,-rpath,/opt/nvidia/deepstream/deepstream-5.0/lib/ -L/usr/local/lib -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_videoio -lopencv_cudafilters -lgstvideo-1.0 -lgstbase-1.0 -lgstreamer-1.0 -lgobject-2.0 -lglib-2.0
gstdsexample.o: In function get_converted_mat(_GstDsExample*, NvBufSurface*, int, _NvOSD_RectParams*, double&, int, int)': gstdsexample.cpp:(.text+0x2188): undefined reference to cuGraphicsEGLRegisterImage’
gstdsexample.cpp:(.text+0x21a0): undefined reference to cuGraphicsResourceGetMappedEglFrame' gstdsexample.cpp:(.text+0x21a8): undefined reference to cuCtxSynchronize’
gstdsexample.cpp:(.text+0x22a4): undefined reference to cuCtxSynchronize' gstdsexample.cpp:(.text+0x22b0): undefined reference to cuGraphicsUnregisterResource’
collect2: error: ld returned 1 exit status
Makefile:83: recipe for target ‘libnvdsgst_dsexample.so’ failed
make: *** [libnvdsgst_dsexample.so] Error 1

what else I might be missing?

my Makefile changes look like this:

ifeq ($(TARGET_DEVICE),aarch64)
	PKGS:= gstreamer-1.0 gstreamer-base-1.0 gstreamer-video-1.0 
	# Add opencv4 to CFLAGS and LIBS
	CFLAGS+= -I /usr/local/include/opencv4
	LIBS+=-L/usr/local/lib -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_videoio -lopencv_cudafilters
else
	PKGS:= gstreamer-1.0 gstreamer-base-1.0 gstreamer-video-1.0 opencv
endif

CFLAGS+=$(shell pkg-config --cflags $(PKGS))
LIBS+=$(shell pkg-config --libs $(PKGS))

My gstdsexample.cpp changes look like this:

#ifdef __aarch64__
  /* To use the converted buffer in CUDA, create an EGLImage and then use
   * CUDA-EGL interop APIs */
  if (USE_EGLIMAGE) {
    if (NvBufSurfaceMapEglImage (dsexample->inter_buf, 0) !=0 ) {
      goto error;
    }

    /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
     * Use interop APIs cuGraphicsEGLRegisterImage and
     * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */
    #if 1
        static bool create_filter = true;
        static cv::Ptr< cv::cuda::Filter > filter;
        CUresult status;
        CUeglFrame eglFrame;
        CUgraphicsResource pResource = NULL;
        cudaFree(0);
        status = cuGraphicsEGLRegisterImage(&pResource,
    		dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage,
                    CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
        status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
        status = cuCtxSynchronize();
        if (create_filter) {
            filter = cv::cuda::createSobelFilter(CV_8UC4, CV_8UC4, 1, 0, 3, 1, cv::BORDER_DEFAULT);
            //filter = cv::cuda::createGaussianFilter(CV_8UC4, CV_8UC4, cv::Size(31,31), 0, 0, cv::BORDER_DEFAULT);
            create_filter = false;
        }
        cv::cuda::GpuMat d_mat(dsexample->processing_height, dsexample->processing_width, CV_8UC4, eglFrame.frame.pPitch[0]);
        filter->apply (d_mat, d_mat);
        status = cuCtxSynchronize();
        status = cuGraphicsUnregisterResource(pResource);

        // apply back to the original buffer
        transform_params.src_rect = &dst_rect;
        transform_params.dst_rect = &src_rect;
        NvBufSurfTransform (dsexample->inter_buf, &ip_surf, &transform_params);
    #endif

    /* Destroy the EGLImage */
    NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);
  }
#endif

Am I change correctly? (by the way what does … in your code mean?) Thank you for your help again.

DaneLLL · August 24, 2020, 11:30pm

Hi,
Loos like -lcudart is in make command. Please add -lcuda and give it a try.

ynjiun · August 24, 2020, 11:36pm

Thank you Dane, after adding -lcuda then compilation is done without error!

paulwang · August 25, 2020, 12:12am

After compiling succesfully (greate step achieved, thanks), I try to test run and found the speed is even slower than before (the original CPU gstdsexample.cpp version) and got a lot of warning message as below:

WARNING: from element /GstPipeline:pipeline0/GstEglGlesSink:eglglessink0: A lot of buffers are being dropped.
Additional debug info:
gstbasesink.c(2902): gst_base_sink_is_too_late (): /GstPipeline:pipeline0/GstEglGlesSink:eglglessink0:
There may be a timestamping problem, or this computer is too slow.

is this normal? whatelse need to be done to fully enjoy GPU speedup in this example? Please advise. Thanks.

P.S. this is my pipeline to run the dsexample:

gst-launch-1.0 filesrc location= ~/data/ar.h264 ! h264parse ! nvv4l2decoder ! m.sink_0 nvstreammux name=m batch-size=1 width=1920 height=1080 ! nvinfer config-file-path= /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/deepstream-test1/dstest1_pgie_config.txt ! nvvideoconvert ! dsexample full-frame=1 blur_objects=false ! nvdsosd ! nvegltransform ! nveglglessink

DaneLLL · August 25, 2020, 12:57am

Hi,
You may execute sudo jetson_clocks to get max performance. And sudo tegrastats to profile system loading and check where the bottleneck is.

ynjiun · August 25, 2020, 11:46pm

per your recommendation, sudo jetson_clocks (CPU 2.3GHz, GPU 1.4GHz) and sudo tegrastats, the situation doesn’t improve that much (still see frame hiccups):
from CPU loading and GPU loading observation,
CPU avg 2765mW mem 2.9GB
GPU avg 2000mW mem 868MB

interestingly if I comment out the code change and resume back without performing Gaussian Filter in cuda (i.e. in your code #if 1 … #endif change to #if 0 … #endif), then the frame rate is much smoother:
CPU avg 2910mW mem 2.7GB
GPU avg 2028mW mem 808MB

I tried to put the same Gaussian Filter into a nvivafilter implementation like here

Then I don’t need to run sudo jetson-clocks and only use 30W ALL mode (CPU 1.2GHz, GPU 905MHz) the resulting pipeline is much much smoother (this is what I would expect running on GPU speed):
CPU avg 1275mW mem 2.2GB
GPU avg 1155mW mem 605MB

Queston: does gstdsexample.cpp have a lot overheads (e.g. unneccessary buf copying back/forth from CPU mem to GPU mem) causing such slow performance (compared with nvivafilter implementation)?

Thank you for your insights in advance.

DaneLLL · August 26, 2020, 1:33am

Hi,
You may check the source code of gst_dsexample_transform_ip(). There is a for loop:

    for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next)

Probably you don’t need the loop in your case and can remove it. Since the default dsexample is for demonstration and reference, you would need to look at the code and customize it to fit your usecase.

ynjiun · August 26, 2020, 3:39am

I kept seeing this warning message while running the pipeline:

0:00:17.140238050 21231 0x559b5cd4f0 WARN v4l2bufferpool gstv4l2bufferpool.c:1491:gst_v4l2_buffer_pool_dqbuf:<nvv4l2decoder0:pool:sink> v4l2 provided buffer that is too big for the memory it was writing into. v4l2 claims 64 bytes used but memory is only 0B. This is probably a driver bug.

do you think this is the source of “slowness”?

DaneLLL · August 26, 2020, 4:57am

Hi,

No, it is harmless. Please refer to

ynjiun · August 26, 2020, 9:07pm

Dane, thank you for your advice. Per your recommendation above I reduce the gst_dsexample_transform_ip to bare minimum just for experimenting GaussianBlur filter in Cuda as below:

static GstFlowReturn
gst_dsexample_transform_ip (GstBaseTransform * btrans, GstBuffer * inbuf)
{
  GstDsExample *dsexample = GST_DSEXAMPLE (btrans);
  GstMapInfo in_map_info;
  GstFlowReturn flow_ret = GST_FLOW_ERROR;

  NvBufSurface *surface = NULL;

  dsexample->frame_num++;
  CHECK_CUDA_STATUS (cudaSetDevice (dsexample->gpu_id),
      "Unable to set cuda device");

  memset (&in_map_info, 0, sizeof (in_map_info));
  if (!gst_buffer_map (inbuf, &in_map_info, GST_MAP_READ)) {
    g_print ("Error: Failed to map gst buffer\n");
    goto error;
  }

  surface = (NvBufSurface *) in_map_info.data;

  if (CHECK_NVDS_MEMORY_AND_GPUID (dsexample, surface))
    goto error;

//////////////////////cuda filter experiment//////////////////////
#ifdef __aarch64__
  /* To use the converted buffer in CUDA, create an EGLImage and then use
   * CUDA-EGL interop APIs */
  if (USE_EGLIMAGE) {
    if (NvBufSurfaceMapEglImage (surface, 0) !=0 ) {
      goto error;
    }

    /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
     * Use interop APIs cuGraphicsEGLRegisterImage and
     * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */
    #if 1
        //static bool create_filter = true;
        //static cv::Ptr< cv::cuda::Filter > filter;
        CUresult status;
        CUeglFrame eglFrame;
        CUgraphicsResource pResource = NULL;
        cudaFree(0);
        status = cuGraphicsEGLRegisterImage(&pResource,
    		surface->surfaceList[0].mappedAddr.eglImage,
                    CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
        status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
        status = cuCtxSynchronize();

        cv::cuda::GpuMat d_mat(dsexample->processing_height, dsexample->processing_width, CV_8UC4, eglFrame.frame.pPitch[0]);

        filter->apply (d_mat, d_mat);

        status = cuCtxSynchronize();
        status = cuGraphicsUnregisterResource(pResource);

    #endif
    /* Destroy the EGLImage */
    NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);
  }
#endif

/////////////////////end of experiment////////////////////////////

  flow_ret = GST_FLOW_OK;

error:
  gst_buffer_unmap (inbuf, &in_map_info);
  return flow_ret;
}

I was able to “make” and “sudo make install” successfully, when I run the pipeline, couple things I observed:

now is very fast. When run even in “30W ALL” mode, never drop a frame any more => that’s very good.
however the behaviour of filter act funny: it only filter (blur) the top 1/4 of the frame and bottom 3/4 frame are not filtered (not blur).

Question: am I manipulating the “surface” (eglFrame) correctly? if not, how would this in-place transformation (inbuf → filter → inbuf without copying) be done?

Thank you very much for your help again.

P.S. house keeping changes:

//create filter in gst_dsexample_start
static gboolean
gst_dsexample_start (GstBaseTransform * btrans)
{
....
    filter = cv::cuda::createGaussianFilter(CV_8UC4, CV_8UC4, cv::Size(31,31), 0, 0, cv::BORDER_DEFAULT);
....
}

and declare filter variable in gstdsexample.h
cv::Ptr<cv::cuda::Filter> filter;

ynjiun · August 26, 2020, 9:41pm

Puzzle solved:

however the behaviour of filter act funny: it only filter (blur) the top 1/4 of the frame and bottom 3/4 frame are not filtered (not blur).

Solution: in the pipeline dsexample need to specify the processing-width/height otherwise if will use the default resolution 640x480 which explain why only the top 1/4-ish got filtered for a 1920x1080 resolution.

The following pipeline correct the issue and run fast using the reduced gst_dsexample_transform_ip as shown in previous post:

gst-launch-1.0 --gst-debug-level=0 filesrc location= ~/data/ar.h264 ! h264parse ! nvv4l2decoder ! m.sink_0 nvstreammux name=m batch-size=1 width=1920 height=1080 ! nvinfer config-file-path= /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/deepstream-test1/dstest1_pgie_config.txt ! nvvideoconvert ! dsexample full-frame=1 processing-width=1920 processing-height=1080 ! nvdsosd ! nvegltransform ! nveglglessink

ynjiun · August 28, 2020, 11:46pm

Hi Dane, Thank you for your help. If I want to create the same development environment in docker. Do I need to run this “install_opencv4.3.0_Jetson.sh” eqivalent script in docker environment? what would be the equivalent script? to enable me to develop the DS plugin with cuda opencv? Thank you again for your help.

kayccc · September 1, 2020, 2:28am

Hi ynjiun,

Please help to create a new topic. Thanks

Topic		Replies	Views
How to create opencv gpumat from nvstream? DeepStream SDK	36	14466	July 27, 2021
Error generated while running the code after connecting the camera Jetson Xavier NX gstreamer , nvbugs	45	1247	January 2, 2024
OpenCV CUDA processing from gstreamer pipeline [JP4, JP5] Jetson AGX Orin opencv , gstreamer	3	1008	November 30, 2023
Deepstream pipeline waits for input indefinitely DeepStream SDK deepstream61	19	674	June 16, 2022
How to NvBufSurface to cv::Mat? DeepStream SDK	12	436	July 2, 2024
How to get a Cuda GpuMat into Gstreamer? Jetson TX2 cuda , gstreamer	13	2859	October 18, 2021
Deepstream_test_1.py doesn`t work DeepStream SDK	23	1567	December 12, 2022
Opencv gpu mat into GStreamer without downloading to cpu Jetson Nano opencv , gstreamer	19	8438	October 13, 2021
Does nvmultiurisrcbin or nvurisrcbin not work in jetson? DeepStream SDK jetson , deepstream	26	54	February 24, 2025
How we can run deepstream-3d-action-recognition using Python? DeepStream SDK gstreamer , deepstream , deepstream61	14	1102	March 15, 2023

Gstdsexample plugin is slow: does GaussianBlur run on GPU?

Related topics