Gstdsexample plugin is slow: does GaussianBlur run on GPU?

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson AGX Xavier
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only) 4.4

I am running gstdsexample plugin and turn on objects_blur=true. Once blur turn on, the frame rate drop significantly. My question: does this imply the example only show GaussianBlur running on CPU? not GPU?

If yes, how can I modify the example code to run GaussianBlur in GPU to speedup the frame rate?

Please advise. Thanks a lot.

Hi,
Please check source cod eof dsexample in

/opt/nvidia/deepstream/deepstream-5.0/sources/gst-plugins/gst-dsexample/gstdsexample.cpp

It calls GaussianBlur() which is done on CPU:
https://docs.opencv.org/3.4/d4/d86/group__imgproc__filter.html#gaabe8c836e97159a9193fb0b11ac52cf1

You may use the script to enable cuda filter:


And refer to the sample:

to create CUDA Gaussian filter:
https://docs.opencv.org/master/dc/d66/group__cudafilters.html#gaa4df286369114cfd4b144ae211f6a6c8

Hi Dane, Thank you for the information. I follow the steps to install cuda filter then modify get_converted_mat() code and Makefile accordingly and compile the code. I got the following error message:

paul@agx:~/vidar/gst-dsexample$ make
-fPIC -DDS_VERSION=“5.0.0” -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include
g++ -c -o gstdsexample.o -fPIC -DDS_VERSION=“5.0.0” -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include gstdsexample.cpp
In file included from gstdsexample.cpp:15:0:
gstdsexample.h:39:10: fatal error: opencv2/cudafilters.hpp: No such file or directory
#include <opencv2/cudafilters.hpp>
^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Makefile:79: recipe for target ‘gstdsexample.o’ failed
make: *** [gstdsexample.o] Error 1

I found the cudafilters.hpp in my install folder (~/Documents/cuda_python/install/opencv_contrib-4.3.0/modules/cudafilters/include/opencv2/cudafilters.hpp). This is because I copy install_opencv4.3.0_Jetson.sh into my ~/Documents/cuda_python directory and mkdir install, then say:
./install_opencv4.3.0_Jetson.sh install
after 90 minutes, everything compiled without error in my AGX Xavier.
Did I specify the install folder incorrectly? What installation folder I should specify?
I also notice in the install_opencv4.3.0_Jetson.sh, there are 3 lines been commented out:

#sudo make install
#echo ‘export PYTHONPATH=$PYTHONPATH:’$PWD’/python_loader/’ >> ~/.bashrc
#source ~/.bashrc

are these correct? or we need to uncomment out these 3 lines to install all the files to proper directory?

or I missing something else?

Please advise. Thanks a lot for your help.

Hi,
Please execute sudo make install. This should install the header files and built libraries.

Thank you Dane. After I sudo make install, most of the problem gone, but still encounter following error message when make:

paul@agx:~/vidar/gst-dsexample$ make
-fPIC -DDS_VERSION=“5.0.0” -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/local/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include
g++ -c -o gstdsexample.o -fPIC -DDS_VERSION=“5.0.0” -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/local/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include gstdsexample.cpp
-fPIC -DDS_VERSION=“5.0.0” -I /usr/local/cuda-10.2/include -I /opt/nvidia/deepstream/deepstream-5.0/sources/includes -I /usr/local/include/opencv4 -pthread -I/usr/include/gstreamer-1.0 -I/usr/include/orc-0.4 -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/aarch64-linux-gnu/glib-2.0/include
g++ -o libnvdsgst_dsexample.so gstdsexample.o -shared -Wl,-no-undefined -L dsexample_lib -ldsexample -L/usr/local/cuda-10.2/lib64/ -lcudart -ldl -lnppc -lnppig -lnpps -lnppicc -lnppidei -L/opt/nvidia/deepstream/deepstream-5.0/lib/ -lnvdsgst_helper -lnvdsgst_meta -lnvds_meta -lnvbufsurface -lnvbufsurftransform -Wl,-rpath,/opt/nvidia/deepstream/deepstream-5.0/lib/ -L/usr/local/lib -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_videoio -lopencv_cudafilters -lgstvideo-1.0 -lgstbase-1.0 -lgstreamer-1.0 -lgobject-2.0 -lglib-2.0
gstdsexample.o: In function get_converted_mat(_GstDsExample*, NvBufSurface*, int, _NvOSD_RectParams*, double&, int, int)': gstdsexample.cpp:(.text+0x2188): undefined reference to cuGraphicsEGLRegisterImage’
gstdsexample.cpp:(.text+0x21a0): undefined reference to cuGraphicsResourceGetMappedEglFrame' gstdsexample.cpp:(.text+0x21a8): undefined reference to cuCtxSynchronize’
gstdsexample.cpp:(.text+0x22a4): undefined reference to cuCtxSynchronize' gstdsexample.cpp:(.text+0x22b0): undefined reference to cuGraphicsUnregisterResource’
collect2: error: ld returned 1 exit status
Makefile:83: recipe for target ‘libnvdsgst_dsexample.so’ failed
make: *** [libnvdsgst_dsexample.so] Error 1

what else I might be missing?

my Makefile changes look like this:

ifeq ($(TARGET_DEVICE),aarch64)
	PKGS:= gstreamer-1.0 gstreamer-base-1.0 gstreamer-video-1.0 
	# Add opencv4 to CFLAGS and LIBS
	CFLAGS+= -I /usr/local/include/opencv4
	LIBS+=-L/usr/local/lib -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_videoio -lopencv_cudafilters
else
	PKGS:= gstreamer-1.0 gstreamer-base-1.0 gstreamer-video-1.0 opencv
endif

CFLAGS+=$(shell pkg-config --cflags $(PKGS))
LIBS+=$(shell pkg-config --libs $(PKGS))

My gstdsexample.cpp changes look like this:

#ifdef __aarch64__
  /* To use the converted buffer in CUDA, create an EGLImage and then use
   * CUDA-EGL interop APIs */
  if (USE_EGLIMAGE) {
    if (NvBufSurfaceMapEglImage (dsexample->inter_buf, 0) !=0 ) {
      goto error;
    }

    /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
     * Use interop APIs cuGraphicsEGLRegisterImage and
     * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */
    #if 1
        static bool create_filter = true;
        static cv::Ptr< cv::cuda::Filter > filter;
        CUresult status;
        CUeglFrame eglFrame;
        CUgraphicsResource pResource = NULL;
        cudaFree(0);
        status = cuGraphicsEGLRegisterImage(&pResource,
    		dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage,
                    CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
        status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
        status = cuCtxSynchronize();
        if (create_filter) {
            filter = cv::cuda::createSobelFilter(CV_8UC4, CV_8UC4, 1, 0, 3, 1, cv::BORDER_DEFAULT);
            //filter = cv::cuda::createGaussianFilter(CV_8UC4, CV_8UC4, cv::Size(31,31), 0, 0, cv::BORDER_DEFAULT);
            create_filter = false;
        }
        cv::cuda::GpuMat d_mat(dsexample->processing_height, dsexample->processing_width, CV_8UC4, eglFrame.frame.pPitch[0]);
        filter->apply (d_mat, d_mat);
        status = cuCtxSynchronize();
        status = cuGraphicsUnregisterResource(pResource);

        // apply back to the original buffer
        transform_params.src_rect = &dst_rect;
        transform_params.dst_rect = &src_rect;
        NvBufSurfTransform (dsexample->inter_buf, &ip_surf, &transform_params);
    #endif

    /* Destroy the EGLImage */
    NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);
  }
#endif

Am I change correctly? (by the way what does in your code mean?) Thank you for your help again.

Hi,
Loos like -lcudart is in make command. Please add -lcuda and give it a try.

Thank you Dane, after adding -lcuda then compilation is done without error!

After compiling succesfully (greate step achieved, thanks), I try to test run and found the speed is even slower than before (the original CPU gstdsexample.cpp version) and got a lot of warning message as below:

WARNING: from element /GstPipeline:pipeline0/GstEglGlesSink:eglglessink0: A lot of buffers are being dropped.
Additional debug info:
gstbasesink.c(2902): gst_base_sink_is_too_late (): /GstPipeline:pipeline0/GstEglGlesSink:eglglessink0:
There may be a timestamping problem, or this computer is too slow.

is this normal? whatelse need to be done to fully enjoy GPU speedup in this example? Please advise. Thanks.

P.S. this is my pipeline to run the dsexample:

gst-launch-1.0 filesrc location= ~/data/ar.h264 ! h264parse ! nvv4l2decoder ! m.sink_0 nvstreammux name=m batch-size=1 width=1920 height=1080 ! nvinfer config-file-path= /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/deepstream-test1/dstest1_pgie_config.txt ! nvvideoconvert ! dsexample full-frame=1 blur_objects=false ! nvdsosd ! nvegltransform ! nveglglessink

Hi,
You may execute sudo jetson_clocks to get max performance. And sudo tegrastats to profile system loading and check where the bottleneck is.

per your recommendation, sudo jetson_clocks (CPU 2.3GHz, GPU 1.4GHz) and sudo tegrastats, the situation doesn’t improve that much (still see frame hiccups):
from CPU loading and GPU loading observation,
CPU avg 2765mW mem 2.9GB
GPU avg 2000mW mem 868MB

interestingly if I comment out the code change and resume back without performing Gaussian Filter in cuda (i.e. in your code #if 1 … #endif change to #if 0 … #endif), then the frame rate is much smoother:
CPU avg 2910mW mem 2.7GB
GPU avg 2028mW mem 808MB

I tried to put the same Gaussian Filter into a nvivafilter implementation like here

Then I don’t need to run sudo jetson-clocks and only use 30W ALL mode (CPU 1.2GHz, GPU 905MHz) the resulting pipeline is much much smoother (this is what I would expect running on GPU speed):
CPU avg 1275mW mem 2.2GB
GPU avg 1155mW mem 605MB

Queston: does gstdsexample.cpp have a lot overheads (e.g. unneccessary buf copying back/forth from CPU mem to GPU mem) causing such slow performance (compared with nvivafilter implementation)?

Thank you for your insights in advance.

Hi,
You may check the source code of gst_dsexample_transform_ip(). There is a for loop:

    for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next)

Probably you don’t need the loop in your case and can remove it. Since the default dsexample is for demonstration and reference, you would need to look at the code and customize it to fit your usecase.

I kept seeing this warning message while running the pipeline:

0:00:17.140238050 21231 0x559b5cd4f0 WARN v4l2bufferpool gstv4l2bufferpool.c:1491:gst_v4l2_buffer_pool_dqbuf:<nvv4l2decoder0:pool:sink> v4l2 provided buffer that is too big for the memory it was writing into. v4l2 claims 64 bytes used but memory is only 0B. This is probably a driver bug.

do you think this is the source of “slowness”?

Hi,

No, it is harmless. Please refer to

Dane, thank you for your advice. Per your recommendation above I reduce the gst_dsexample_transform_ip to bare minimum just for experimenting GaussianBlur filter in Cuda as below:

static GstFlowReturn
gst_dsexample_transform_ip (GstBaseTransform * btrans, GstBuffer * inbuf)
{
  GstDsExample *dsexample = GST_DSEXAMPLE (btrans);
  GstMapInfo in_map_info;
  GstFlowReturn flow_ret = GST_FLOW_ERROR;

  NvBufSurface *surface = NULL;

  dsexample->frame_num++;
  CHECK_CUDA_STATUS (cudaSetDevice (dsexample->gpu_id),
      "Unable to set cuda device");

  memset (&in_map_info, 0, sizeof (in_map_info));
  if (!gst_buffer_map (inbuf, &in_map_info, GST_MAP_READ)) {
    g_print ("Error: Failed to map gst buffer\n");
    goto error;
  }

  surface = (NvBufSurface *) in_map_info.data;

  if (CHECK_NVDS_MEMORY_AND_GPUID (dsexample, surface))
    goto error;

//////////////////////cuda filter experiment//////////////////////
#ifdef __aarch64__
  /* To use the converted buffer in CUDA, create an EGLImage and then use
   * CUDA-EGL interop APIs */
  if (USE_EGLIMAGE) {
    if (NvBufSurfaceMapEglImage (surface, 0) !=0 ) {
      goto error;
    }

    /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
     * Use interop APIs cuGraphicsEGLRegisterImage and
     * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */
    #if 1
        //static bool create_filter = true;
        //static cv::Ptr< cv::cuda::Filter > filter;
        CUresult status;
        CUeglFrame eglFrame;
        CUgraphicsResource pResource = NULL;
        cudaFree(0);
        status = cuGraphicsEGLRegisterImage(&pResource,
    		surface->surfaceList[0].mappedAddr.eglImage,
                    CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
        status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
        status = cuCtxSynchronize();

        cv::cuda::GpuMat d_mat(dsexample->processing_height, dsexample->processing_width, CV_8UC4, eglFrame.frame.pPitch[0]);

        filter->apply (d_mat, d_mat);

        status = cuCtxSynchronize();
        status = cuGraphicsUnregisterResource(pResource);

    #endif
    /* Destroy the EGLImage */
    NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);
  }
#endif

/////////////////////end of experiment////////////////////////////

  flow_ret = GST_FLOW_OK;

error:
  gst_buffer_unmap (inbuf, &in_map_info);
  return flow_ret;
}

I was able to “make” and “sudo make install” successfully, when I run the pipeline, couple things I observed:

  1. now is very fast. When run even in “30W ALL” mode, never drop a frame any more => that’s very good.
  2. however the behaviour of filter act funny: it only filter (blur) the top 1/4 of the frame and bottom 3/4 frame are not filtered (not blur).

Question: am I manipulating the “surface” (eglFrame) correctly? if not, how would this in-place transformation (inbuf -> filter -> inbuf without copying) be done?

Thank you very much for your help again.

P.S. house keeping changes:

//create filter in gst_dsexample_start
static gboolean
gst_dsexample_start (GstBaseTransform * btrans)
{
....
    filter = cv::cuda::createGaussianFilter(CV_8UC4, CV_8UC4, cv::Size(31,31), 0, 0, cv::BORDER_DEFAULT);
....
}

and declare filter variable in gstdsexample.h
cv::Ptr<cv::cuda::Filter> filter;

Puzzle solved:

however the behaviour of filter act funny: it only filter (blur) the top 1/4 of the frame and bottom 3/4 frame are not filtered (not blur).

Solution: in the pipeline dsexample need to specify the processing-width/height otherwise if will use the default resolution 640x480 which explain why only the top 1/4-ish got filtered for a 1920x1080 resolution.

The following pipeline correct the issue and run fast using the reduced gst_dsexample_transform_ip as shown in previous post:

gst-launch-1.0 --gst-debug-level=0 filesrc location= ~/data/ar.h264 ! h264parse ! nvv4l2decoder ! m.sink_0 nvstreammux name=m batch-size=1 width=1920 height=1080 ! nvinfer config-file-path= /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/deepstream-test1/dstest1_pgie_config.txt ! nvvideoconvert ! dsexample full-frame=1 processing-width=1920 processing-height=1080 ! nvdsosd ! nvegltransform ! nveglglessink

1 Like

Hi Dane, Thank you for your help. If I want to create the same development environment in docker. Do I need to run this “install_opencv4.3.0_Jetson.sh” eqivalent script in docker environment? what would be the equivalent script? to enable me to develop the DS plugin with cuda opencv? Thank you again for your help.

Hi ynjiun,

Please help to create a new topic. Thanks