DsExample plugin performance issue

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 6.2
• TensorRT Version
**• NVIDIA GPU Driver Version (valid for GPU only)**525.125.06
• Issue Type( questions, new requirements, bugs)
I am using the gst-dsexample as the last plugin in my deepstream pipeline. Its purpose is as follows:

  1. Check if there were any OD detections in frame
    a) If there were detections, transform the frame buffer using NvBufSurfTransform function, convert the frame to OpenCV matrix, send the matrix and frame metadata (string) to redis.
    b) If there were no detections, only send metadata to redis.

The frame transformation is causing low performance in this plugin. For now it transforms full resolution 2160x3840 frame. It only resizes the frame with selected interpolation method. I have tried changing interpolation method to NvBufSurfTransformCompute_GPU, but no effect.

I assume that if I don’t use the transformation, I will not be able to put the frame to OpenCV matrix and send it to redis. But it’s a really huge bottle neck - with transformations, I get around 5 FPS, without - about x5 times more.

Here is my get_converted_mat():

GstFlowReturn get_converted_mat (GstDsExample * dsexample, NvBufSurface *input_buf, gint idx)
  NvBufSurfTransform_Error err;
  NvBufSurfTransformConfigParams transform_config_params;
  NvBufSurfTransformParams transform_params;
  NvBufSurfTransformRect src_rect;
  NvBufSurfTransformRect dst_rect;
  NvBufSurface ip_surf;
  cv::Mat in_mat;
  ip_surf = *input_buf;

  ip_surf.numFilled = ip_surf.batchSize = 1;
  ip_surf.surfaceList = &(input_buf->surfaceList[idx]);

  /* Configure transform session parameters for the transformation */
  transform_config_params.compute_mode = NvBufSurfTransformCompute_Default;
  transform_config_params.gpu_id = dsexample->gpu_id;
  transform_config_params.cuda_stream = dsexample->cuda_stream;

  /* Set the transform session parameters for the conversions executed in this
   * thread. */
  err = NvBufSurfTransformSetSessionParams (&transform_config_params);
  if (err != NvBufSurfTransformError_Success) {
        ("NvBufSurfTransformSetSessionParams failed with error %d", err), (NULL));
    goto error;

  transform_params.transform_flag =
  transform_params.transform_filter = NvBufSurfTransformInter_Default;

  /* Memset the memory */
  NvBufSurfaceMemSet (dsexample->inter_buf, 0, 0, 0);

  GST_DEBUG_OBJECT (dsexample, "Scaling and converting input buffer\n");

  /* Transformation scaling+format conversion if any. */
  err = NvBufSurfTransform (&ip_surf, dsexample->inter_buf, &transform_params);
  if (err != NvBufSurfTransformError_Success) {
        ("NvBufSurfTransform failed with error %d while converting buffer", err),
    goto error;
  /* Map the buffer so that it can be accessed by CPU */
  if (NvBufSurfaceMap (dsexample->inter_buf, 0, 0, NVBUF_MAP_READ) != 0) {
    goto error;
  if(dsexample->inter_buf->memType == NVBUF_MEM_SURFACE_ARRAY) {
    /* Cache the mapped data for CPU access */
    NvBufSurfaceSyncForCpu (dsexample->inter_buf, 0, 0);

  /* Use openCV to remove padding and convert RGBA to BGR. Can be skipped if
   * algorithm can handle padded RGBA data. */
  in_mat =
      cv::Mat (dsexample->processing_height, dsexample->processing_width,
      CV_8UC4, dsexample->inter_buf->surfaceList[0].mappedAddr.addr[0],

  cv::cvtColor (in_mat, *dsexample->cvmat, cv::COLOR_RGBA2BGR);
  cv::cvtColor (in_mat, *dsexample->cvmat, CV_RGBA2BGR);

    if (NvBufSurfaceUnMap (dsexample->inter_buf, 0, 0)) {
      goto error;

  if(dsexample->is_integrated) {
#ifdef __aarch64__
    /* To use the converted buffer in CUDA, create an EGLImage and then use
    * CUDA-EGL interop APIs */
    if (USE_EGLIMAGE) {
      if (NvBufSurfaceMapEglImage (dsexample->inter_buf, 0) !=0 ) {
        goto error;

      /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
      * Use interop APIs cuGraphicsEGLRegisterImage and
      * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */

      /* Destroy the EGLImage */
      NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);

  /* We will first convert only the Region of Interest (the entire frame or the
   * object bounding box) to RGB and then scale the converted RGB frame to
   * processing resolution. */
  return GST_FLOW_OK;

  return GST_FLOW_ERROR;

Here is how I encode the image and send it to redis:

void write_frame_to_redis (GstDsExample * dsexample, std::string key) {
  int size = (int)dsexample->cvmat->total() * dsexample->cvmat->channels();
  dsexample->redis_client.hset(key, "image", StringView((char*)dsexample->cvmat->data, size));

Are there any ways to perform the transformation more efficiently? Is there any way to not do the transformation and just send the frame to redis immediately? I have tried skipping the transformation, but the parsed image from redis contains all black pixels.

  1. you could also refer to DeepStream SDK FAQ - #12 by bcao 4 to measure the latency of the pipeline components.
  2. if dsexample plugin costs too much time, please check which code part is the bottle neck. NvBufSurfTransform should not cost too much time because it will use GPU acceleration. please check the time consumption of opencv processing.

I’ve been trying to optimize the plugin, see what causes the bottleneck. Using this forum post Deepstream-opencv-test sample app is considerably slower than gst-launch pipeline I changed the create_params.memType = NVBUF_MEM_CUDA_PINNED; to create_params.memType = NVBUF_MEM_CUDA_UNIFIED;. I also changed nvbuf-memory-type to NVBUF_MEM_CUDA_PINNED in streammux, pgie, sgie and in sources, i changed the cudadec-memtype to 2 (memtype_unified). With single memory type in the entire pipeline, processing single source gave x4 speedup, form 16FPS to 64FPS. However, increasing the number of sources seems to diminish this effect: using 6 sources, the FPS speedup was 2.5FPS → 5FPS.

  1. Do you have any recommendations for solving bottleneck that dsexample causes? The speedup for one source was good, but it does not scale to many sources, and I need to use around 20 sources in total. In retail, it’s important to transfer the full frame to other applications for post-processing, to have some context about the ML detections.

There is no update from you for a period, assuming this is not an issue any more. Hence we are closing this topic. If need further support, please open a new one. Thanks.
we suggest using test data to analyze this issue. for example, use this method to measure the latency of the pipeline components, add log in dsexample to check if NvBufSurfTransform will cost more time when using more source.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.