Using VPI in a Custom DeepStream Plugin

• Hardware Platform (Jetson / GPU) Jetson Xavie
• DeepStream Version 5.1
• Issue Type question

Hey there.
I am planning on implementing a custom deepstream plugin that does a perspective transformation on the frames in the buffer. After some research, I decided to use the VPI framework. I have started from the gstdsexample plugin and took hints from posts such as this one and this one. Here is the essential logic of my plugin’s transform_ip function:

    ...
    NvBufSurfaceMap(input_buf, idx, 0, NVBUF_MAP_READ_WRITE);
    // NvBufSurfaceSyncForCpu (input_buf, 0, 0);

    guint src_width = GST_ROUND_UP_2((unsigned int)vpiwarp->video_info.width);
    guint src_height = GST_ROUND_UP_2((unsigned int)(gint)vpiwarp->video_info.height);

    VPIPerspectiveTransform h1={
            { 1.0, 0.0, 0.0 },
            { 0.0, 1.0, 100 },
            { 5, 1.2, 1.0 }
            };

    CHECK_VPI_STATUS(vpiCreatePerspectiveWarp(VPI_BACKEND_VIC, &vpiwarp->vic_warp));

    VPIImage img = NULL;

    VPIImageData img_data;
    memset(&img_data, 0, sizeof(img_data));

    img_data.format = VPI_IMAGE_FORMAT_RGBA8;
    img_data.numPlanes = input_buf->surfaceList[idx].planeParams.num_planes;

    img_data.planes[0].width = src_width; 
    img_data.planes[0].height = src_height;

    img_data.planes[0].pitchBytes = input_buf->surfaceList[idx].planeParams.pitch[0];
    img_data.planes[0].data = input_buf->surfaceList[idx].mappedAddr.addr[0];

    CHECK_VPI_STATUS(vpiImageCreateHostMemWrapper(&img_data, 0, &img));

    CHECK_VPI_STATUS(vpiSubmitPerspectiveWarp(vpiwarp->stream, VPI_BACKEND_VIC,vpiwarp->vic_warp, img, h1, 
                    img, VPI_INTERP_LINEAR, VPI_BORDER_ZERO, 0));
    vpiStreamSync(vpiwarp->stream);

    NvBufSurfaceUnMap(input_buf, idx, 0);
    vpiImageDestroy(img);

    return GST_FLOW_OK;

Here is my pipeline:

gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream-5.1/samples/streams/sample_720p.h264 ! m.sink_0 \
nvstreammux name=m batch-size=1 width=1280 height=720 ! nvvideoconvert ! \
"video/x-raw(memory:NVMM),format=RGBA" ! vpiwarp ! nvvideoconvert ! \
x264enc ! filesink location=out.h264

I have two main concerns:

  1. Let’s say my GPU will be busy in my original application for inference, and I want to use the VIC as the backend. What kind of Wrapper should I use for the most efficient VIC usage?
    Should I use Host, CUDA (with EGL), or NvBuffer? Which one is fastest for using the VIC?

  2. Secondly, when I apply the transformation, I want it to replace the original frame in the buffer. I would like the next plugins to see the modified frame. But my current code doesn’t do that. the video file created by the sink element does not have the transformed frames!
    One solution is of course to let go of the in-place plugin and have two buffers. But I want to know if I can avoid that (use in place plugin and change the frames in buffer with the VPI output).

Thank you for spending time on this.
Best.

Hi,

1.
Do you see the attached file in this comment?
We wrap the buffer with vpiImageCreateCUDAMemWrapper directly to avoid duplicate buffer mapping.

2.
Since perspective warp is not a point-to-point operation, you need to use different buffers to store input and output.

CHECK_VPI_STATUS(vpiSubmitPerspectiveWarp(dsexample->vpi_stream, 0, dsexample->warp, img, xform, out
                                            , VPI_INTERP_LINEAR, VPI_BORDER_ZERO, 0));
CHECK_VPI_STATUS(vpiSubmitConvertImageFormat(dsexample->vpi_stream, VPI_BACKEND_CUDA, out, img, NULL));
CHECK_VPI_STATUS(vpiStreamSync(dsexample->vpi_stream))

Please check the attached file for the detailed implementation.

Thanks.

1 Like

Dear AastaLLL,
Thank you for your help.
I took your advice and changed my code. I now use the Cuda Wrapper and two buffers as you suggested and it works <3 . But the performance is still slower than I expected. According to the VPI’s benchmarks from here and here, the warp_perspective should run in about 5000 FPS at least, on the nearest config using CUDA.

Screen Shot 2021-08-10 at 1.00.16 PM

My input and output size is 1280x720, hence even faster performance is expected.
But my warp operation takes about 2ms !!!, about 2000 times slower than what it should be!

Here is how I measured the time:

    auto stop5 = std::chrono::high_resolution_clock::now();


    CHECK_VPI_STATUS(vpiSubmitPerspectiveWarp(vpiwarp->stream, VPI_BACKEND_CUDA,vpiwarp->vic_warp, vpiwarp->img,
          h1, vpiwarp->out, VPI_INTERP_NEAREST, VPI_BORDER_ZERO, 0));
    CHECK_VPI_STATUS(vpiStreamSync(vpiwarp->stream));

    auto stop6 = std::chrono::high_resolution_clock::now();
    std::cout << "stop6: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(stop6 - stop5).count()
                                   * 1E-9) <<std::endl;
    //this line outputs numbers near 0.002

My pipeline is same as my first post on this topic.

My code is changed according to your suggestion:

  • in transform_ip I have:
...boilerplate stuff
  for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next){

      warp_frame (vpiwarp, surface, i);
      i++;
  } 
...
  • in warp_frame I have:
warp_frame (VpiWarp * vpiwarp, NvBufSurface *input_buf, gint idx)
{

    CUresult status;
    CUeglFrame eglFrame;
    CUgraphicsResource pResource = NULL;

    static guint src_width = GST_ROUND_UP_2((unsigned int)vpiwarp->video_info.width);
    static guint src_height = GST_ROUND_UP_2((unsigned int)(gint)vpiwarp->video_info.height);

    VPIPerspectiveTransform h1={
            { 0.86, 1, 0 },
            { 1, -0.86, 0 }, 
            { 0, 0, 1 }};

    NvBufSurfaceMapEglImage (input_buf, -1) != 0)

    cuGraphicsEGLRegisterImage (&pResource,
          input_buf->surfaceList[0].mappedAddr.eglImage,
          CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE)

    cuGraphicsResourceGetMappedEglFrame (&eglFrame, pResource, 0, 0) 
    cuCtxSynchronize();

    //I added this condition to make sure only "Create" the wrapper once and "reset" it on the next use
   // found out that this is faster than re-creating it on each iteration.
    if (vpiwarp->img == NULL){
      memset(&vpiwarp->img_data, 0, sizeof(vpiwarp->img_data));
      vpiwarp->img_data.format = VPI_IMAGE_FORMAT_RGBA8;
      vpiwarp->img_data.numPlanes = input_buf->surfaceList[idx].planeParams.num_planes;
      vpiwarp->img_data.planes[0].width = input_buf->surfaceList[idx].planeParams.width[0];
      vpiwarp->img_data.planes[0].height = input_buf->surfaceList[idx].planeParams.height[0];

      vpiwarp->img_data.planes[0].pitchBytes = input_buf->surfaceList[idx].planeParams.pitch[0];
      vpiwarp->img_data.planes[0].data = eglFrame.frame.pPitch[0];
      
      vpiImageCreateCUDAMemWrapper(&vpiwarp->img_data, 0, &vpiwarp->img);    

    }else{
      vpiwarp->img_data.planes[0].pitchBytes = input_buf->surfaceList[idx].planeParams.pitch[0];
      vpiwarp->img_data.planes[0].data = eglFrame.frame.pPitch[0];
      vpiImageSetWrappedCUDAMem(vpiwarp->img,&vpiwarp->img_data);

    }

    vpiSubmitPerspectiveWarp(vpiwarp->stream, VPI_BACKEND_CUDA,vpiwarp->vic_warp, vpiwarp->img,
          h1, vpiwarp->out, VPI_INTERP_NEAREST, VPI_BORDER_ZERO, 0);
    vpiStreamSync(vpiwarp->stream);


    vpiSubmitConvertImageFormat(vpiwarp->stream, VPI_BACKEND_CUDA, vpiwarp->out, vpiwarp->img, NULL);
    vpiStreamSync(vpiwarp->stream);

    cuGraphicsUnregisterResource(pResource);
    NvBufSurfaceUnMapEglImage (input_buf, 0);

    return GST_FLOW_OK;
  
}

I should probably also mention that there is another pipeline running on the device which uses nvinfer and peoplenet. I know this can reduce the performance significantly, but I think there is another problem with the code that makes it this much slow. I currently can’t shutdown the other pipeline due to a running demo. I will let you if it solves the issue.

May I ask if you find any issues on my current code that can cause unnecessary latency?

If you don’t mind, I had one other quick question as well:

  1. Why instead of the CUDA wrapper, we don’t use EGLImage wrapper from here. Because NvBufSurface already has an eglimage according to here. Wouldn’t it be faster than CUDA wrapper?

Thank you very much for spending time on this.

Best.

Hi,

The benchmark value is ~0.18ms and your result is 2ms.
It should be ~11x slower rather than 2000x.

For 10x slower, it is possible due to the different settings.
The score is tested with the instruction shared in the below doc:
https://docs.nvidia.com/vpi/algo_performance.html#benchmark

The main difference is that it has maximized the performance with a custom script.
And profiling the performance in the batch manner.

This means it runs warping in batch (ex. 16x) at the same time.
But in Deepstream, due to pipeline dependency, you can only feed one image per time, which causes the difference.

You can also give EGLImage wrapper a try.
But since the EGL <-> CUDA is just a header mapping.
It’s expected the performance will be similar.

Thanks.