Nvvidconv transformation slow

• Hardware Platform: Jetson Orin
• DeepStream Version: 6.2
• JetPack Version: 5.1.2
• TensorRT Version: 8.5.2
• Issue Type: questions

Hi NVIDIA Team,

I have a question related to the performance regarding the nvvidconv gstreamer plugin and its transformations like color conversion. I created two pipelines for color conversion from NV12 to I420. One on the jetson with memory:NVMM and one on a DGPU based system with memory:CUDAMemory.

The color conversion takes up to 5ms from NV12 to I420 on the jetson and a fraction with cudaconvert on the DGPU based system (see statistics below, plugins named colorconversion). I also used the NvBufSurfTransform low level API directly to validate the gst-stats latency statistics. Furthermore, jetson_clocks is set to max and MAXN mode is activated.

And here are my questions:

  1. Why is this taking so long? Are there any restrictions on the Jetson?
  2. Is there another way to speed up the color conversion using nvvidconv or the low-level API differently?
  3. Can we expect changes coming with jetpack 6 on nvvidconv or the low-level-API which will improve this?

Pipeline Jetson (NVMM):
gst-launch-1.0 videotestsrc ! “video/x-raw,format=NV12,height=1080,width=1920” ! nvvidconv name=upload ! “video/x-raw(memory:NVMM),format=NV12” ! nvvidconv name=colorconversion ! “video/x-raw(memory:NVMM),format=I420” ! fakesink

Element Latency Statistics:
0xaaaac5f9a240.capsfilter0.src: mean=0:00:00.000037867 min=0:00:00.000031072 max=0:00:00.000064832
0xaaaac5f8e2b0.upload.src: mean=0:00:00.006018292 min=0:00:00.005887781 max=0:00:00.007127507
0xaaaac5f9a580.capsfilter1.src: mean=0:00:00.000067372 min=0:00:00.000053121 max=0:00:00.000130753
0xaaaac5f8edb0.colorconversion.src: mean=0:00:00.004943557 min=0:00:00.004919257 max=0:00:00.005199036
0xaaaac5f9a8c0.capsfilter2.src: mean=0:00:00.000056445 min=0:00:00.000044289 max=0:00:00.000113473

Pipeline DGPU (CUDAMemory)
gst-launch-1.0 videotestsrc ! “video/x-raw,format=NV12,height=1080,width=1920” ! cudaupload name=upload ! “video/x-raw(memory:CUDAMemory),format=NV12” ! cudaconvert name=colorconversion ! “video/x-raw(memory:CUDAMemory),format=I420” ! fakesink

Element Latency Statistics:
0x5683bf1f3390.capsfilter0.src: mean=0:00:00.000010096 min=0:00:00.000008035 max=0:00:00.000064963
0x5683bf5d07d0.upload.src: mean=0:00:00.000354512 min=0:00:00.000330104 max=0:00:00.005563382
0x5683bf5cfab0.capsfilter1.src: mean=0:00:00.000014520 min=0:00:00.000010550 max=0:00:00.000393354
0x5683bf139260.colorconversion.src: mean=0:00:00.000019263 min=0:00:00.000012815 max=0:00:00.000372935
0x5683bf9a4460.capsfilter2.src: mean=0:00:00.000012304 min=0:00:00.000008576 max=0:00:00.000050165

Thank you for your help!

Best regards
Sven

May I know how you get these latency data? So I can check that on my Jetpack 6.0 with DeepStream 6.4.

Hi,
Thanks for your quick reply. For measuring the latency I’m using the gst-tracer Tracing.

Here is the pipeline for measuring:

GST_DEBUG="GST_TRACER:7" GST_TRACERS="latency(flags=pipeline+element)" GST_DEBUG_FILE=trace.log gst-launch-1.0 videotestsrc ! "video/x-raw,format=NV12,height=1080,width=1920" ! nvvidconv name=upload ! "video/x-raw(memory:NVMM),format=NV12" ! nvvidconv name=colorconversion ! "video/x-raw(memory:NVMM),format=I420" ! fakesink

You can close the pipeline after 10 seconds or more and then call:

gst-stats-1.0 trace.log

gst tracing and gst-stats should be available withing the gstreamer package. I hope that gstreamer 1.20 on the jetson is build with coretracers. Otherwise the stats won’t be available. See build options here: subprojects/gstreamer/meson_options.txt · 1.20 · GStreamer / gstreamer · GitLab

Thank you!

There may be the following reasons

  1. The performance on the Dgpu itself is better than the Jetson.
  2. There is a buffer pool in the nvvidconv. So when you track the latency for a certain frame, it includes the latency in the buffer pool.
  3. Your pipeline can also be simplified like below
gst-launch-1.0 videotestsrc ! “video/x-raw,format=NV12,height=1080,width=1920” ! nvvidconv name=colorconversion ! “video/x-raw(memory:NVMM),format=I420” ! fakesink

Thanks for your reply,

I don’t think that the high latency come from the buffer bool because I also tried the low-level API which is also used in the nvvidconv. Here is an example how I think (after checking the source code of nvvidconv) that the nvvidconv is using the API:

   NvBufSurface* dest;

    NvBufSurfaceCreateParams createParams;
    createParams.colorFormat  = NVBUF_COLOR_FORMAT_YUV420;
    createParams.height       = surf->surfaceList->height;
    createParams.width        = surf->surfaceList->width;
    createParams.isContiguous = surf->isContiguous;
    createParams.layout       = surf->surfaceList->layout;
    createParams.memType      = surf->memType;

    // allocate memory
    NvBufSurfaceCreate(&dest, surf->batchSize, &createParams);

    {
        ZoneScopedN("ColorConversion");
        NvBufSurfTransformParams params;
        // do the colorconversion
        NvBufSurfTransform(surf, dest, &params);
    }

ZoneScopedN is from tracy, which is a profiler where I’m measuring the latency. The operation color conversion inside the scope takes 5ms, same like the nvvidconv plugin. So back to my question, why is this taking so long? A cuda kernel for converting NV12 to I420 takes around 0,05ms. This is a factor by 100x. This is not really plausible for me, even if the DGPU is better than the jetson. For applications where each millisecond matters, this way is not really appropriate. Or is my code wrong?

Another Idea for speeding up would be to write a gstreamer plugin which is using the CUDA-EGL-INTEROP for getting the pointer to the cuda data and using a cuda kernel for color conversion. Is this recommended/good idea?

If you want to make a comparison, the following 3 pipelines is more meaningful.
Dgpu:

gst-launch-1.0 videotestsrc ! \
"video/x-raw,format=NV12,height=1080,width=1920" ! \
nvvideoconvert ! \
"video/x-raw(memory:NVMM),format=I420" ! fakesink

Jetson VIC:

gst-launch-1.0 videotestsrc ! \
"video/x-raw,format=NV12,height=1080,width=1920" ! \
nvvidconv ! \
"video/x-raw(memory:NVMM),format=I420" ! fakesink

Jetson GPU:

gst-launch-1.0 videotestsrc ! \
"video/x-raw,format=NV12,height=1080,width=1920" ! \
nvvidconv  compute-hw=1 ! \
"video/x-raw(memory:NVMM),format=I420" ! fakesink

You can also use export DBG_NVBUFSURFTRANSFORM=1 to check the latency of each frame in the low-level API.
It’s normal that Jetson VIC > Jetson Gpu > DGpu.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks