Measurement of the Effects of Rivermax and GPUDirect Using nvdsudpsrc

Please provide complete information as applicable to your setup.

• Hardware Platform (GPU):A100X
• DeepStream Version:6.1.1
• TensorRT Version: 8.4.1.5
• NVIDIA GPU Driver Version (valid for GPU only):515.65.01
• Issue Type( questions)

I want to measure the effect of Rivermax and GPUDirect,
when I use nvdsudpsrc to receive the h.264 stream.

I ran and compared the following two pipelines:.
I used Nsight as a measurement tool.

nsys profile --kill none -d 60 -o outputfile -f true
gst-launch -1.0 -e nvdsudpsrc address = 192.168.0.2 local-iface-ip = 192.168.0.2 port = 8500 header-size = 12!
application/x-rtp, media = video, encoding-name = H 264!
queue!
rtph 264 depend!
h 264 parse!
nvv4l2decoder!
fakesink dump = false

nsys profile --kill none -d 60 -o outputfile -f true
gst-launch -1.0 -e udpsrc address = 192.168.0.2 port = 8500!
application/x-rtp, media = video, encoding-name = H 264!
queue!
rtph 264 depend!
h 264 parse!
nvv4l2decoder!
fakesink dump = false

I looked at the Stats System View-GPU MemOps Summary (by Size) value from Nsight Systems.
There was no apparent change between nvdsudpsrc and udpsrc.
I also used the top command to examine the CPU load average, but there was little difference.

Therefore, the effect of Rivermax and GPUDirect does not appear to be clear.

If I want to see the effect of Rivermax and GPUDirect,
is it correct to compare GPU MemOps Summary (by Size) and CPU load average?
Also, can’t the effect of Rivermax and GPUDirect be measured by Nsight?

If I can’t measure with Nsight, is there any other way or tool to check the effect of Rivermax and GPUDirect?

Effect of Rivermax and GPU Direct won’t be noticeable for compressed streams (H264/H265/VP8 etc.) because there are comparatively less packets per frames. But it will be quite good if the stream is uncompressed one like YUV 4:2:2 10-bit 1080p or something similar because of very high number of packets per frame.

Secondly, GPU direct has no meaning for compressed streams because depacketization will happen on CPU and will require copy from GPU to SYSTEM memory for the same.

I ran the following pipeline to process uncompressed data.

nsys profile --kill none -d 60 -o outputfile -f true
gst-launch-1.0 -e nvdsudpsrc address=192.168.x.x local-iface-ip=192.168.x.x port=8500 header-size=12 !
‘application/x-rtp, media=(string)video, clock-rate=(int)90000, encoding-name=(string)RAW, sampling=(string)YCbCr-4:2:2, depth=(string)10, width=(string)1920, height=(string)1080, colorimetry=(string)BT709, payload=(int)96’ !
rtpvrawdepay ! nvvideoconvert !
m.sink_0 nvstreammux name=m width=1920 height=1080 batch_size=1 nvbuf-memory-type=2 !
nvinfer config-file-path=./infer_config.txt !
fakesink dump=false

nsys profile --kill none -d 60 -o outputfile -f true
gst-launch-1.0 -e udpsrc address=192.168.x.x port=8500 !
‘application/x-rtp, media=(string)video, clock-rate=(int)90000, encoding-name=(string)RAW, sampling=(string)YCbCr-4:2:2, depth=(string)10, width=(string)1920, height=(string)1080, colorimetry=(string)BT709, payload=(int)96’ !
rtpvrawdepay ! nvvideoconvert !
m.sink_0 nvstreammux name=m width=1920 height=1080 batch_size=1 nvbuf-memory-type=2 !
nvinfer config-file-path=./infer_config.txt !
fakesink dump=false

I measured it by Nsight, but I don’t see the effect of nvdsudpsrc.
The CPU load measured by the top command is also higher with nvdsudpsrc.

If nvdsudpsrc is used and data is copied from the NIC to the GPU,
I think the value of [Stats System View > GPU MemOps Summary (by Size) > CUDA memcpy HtoD] will be low or 0.
Is this wrong?

Also, is the pipeline wrong?

The measurement results are as follows.
CUDA memcpy HtoD
nvdsudpsrc: 7.25 GiB
udpsrc: 6.96 GiB
Even with nvdsudpsrc, it appears to be moving data to the GPU through host memory.

CPU utilization (average over 60 seconds)
nvdsudpsrc: 34.43%
udpsrc: 17.46%

Data is actually moved from CPU memory to GPU memory because open source components are used. OSS depayloader component works with CPU memory.