The Effects of Rivermax and GPUDirect Using nvdsudpsrc

• Hardware Platform (GPU):A100X
• DeepStream Version:6.1.1
• TensorRT Version: 8.4.1.5
• NVIDIA GPU Driver Version (valid for GPU only):515.65.01
• Issue Type( questions)

I would like to ask you a new question about the following matter that I asked you before.

In my previous question, I heard that GPUDirect has little effect on compressed data.
Processing was performed with uncompressed data.
However, the effect of GPUDirect cannot be obtained even with uncompressed data.

We ran the following pipeline:.
gst-launch-1.0 -e nvdsudpsrc address=xxx.xxx.x.x local-iface-ip=xxx.xxx.x.x port=8500 header-size=12 !
‘application/x-rtp, media=(string)video, clock-rate=(int)90000, encoding-name=(string)RAW, sampling=(string)YCbCr-4:2:2, depth=(string)10, width=(string)1920, height=(string)1080, colorimetry=(string)BT709, payload=(int)96’ !
rtpvrawdepay ! nvvideoconvert !
m.sink_0 nvstreammux name=m width=1920 height=1080 batch_size=1 nvbuf-memory-type=2 !
nvinfer config-file-path=./config.txt !
fakesink dump=false

Because rtpvrawdepay is included, we assume that uncompressed data will be copied into system memory and the effect of GPU DIRECT will not be achieved.
Is this correct?

If so, what pipeline would you write to make RIVERMAX work without copying it into system memory?

Yes.

No. The rtpvrawdepay is needed for the RTP protocol. It is not recommended to transfer the RAW video data through network protocol.

Thank you for your reply.

So how do you write a pipeline that uses nvdsudpsrc to benefit from GPUDirect?

I think the original description is Rivermax’s GPUDirect utilizes the high speed PCIe interface to pass the data directly to and from the GPU without burdening the CPU cores NVIDIA Rivermax: Optimized Networking SDK for Data | NVIDIA Developer, the implementation is just UDP stack level. RTP level stack is not implemented. So unavoidably the RTP payload is copied from GPU memory to CPU memory to be handled with the open source RTP stack.

It is better to transfer compressed video data through network(LAN, WAN,…), there is HW video decoder(decompress) to accelerate the whole pipeline with GPU memory.

I have three questions.

The following nvdsudpsrc description states that rtp header and payload separation is supported.
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvdsudpsrc.html

According to this, rtpvrawdepya and rtph264depay don’t seem necessary.
Why are rtpvrawdepya and rtph264depay necessary ?

In your prvious answer, you said as follows.
So unavoidably the RTP payload is copied from GPU memory to CPU memory to be handled with the open source RTP stack.

According to this, data flow inside of nvdsudpsrc seems to be the following:
・data flow inside of nvdsudpsrc
--------> CX6 --------------------------> GPU -----------------> CPU
‘network’ ‘rtp payload’ ‘rtp payload’

In other words, we understand that it sends rtp payloads from the network to the GPU and then to the CPU. Why does it transfer to the GPU instead of directly to the CPU?

How is the data sent to the GPU by nvdsudpsrc supposed to be used?

Any udp src only get the rtp payload out. The rtp depay element will parse the rtp payload to meaningful frames according to the formats(h264, RGB, YCbCr-4:4:4, ac3,…)

No, it is not “send” from GPU to CPU. The memory can be accessed by both GPU and CPU, it is page-locked memory on the host. CUDA Runtime API :: CUDA Toolkit Documentation (nvidia.com). The copy is done by HW but not CPU.

The page-locked memory on the host is used. It is already optimized. The key is you use YCbCr-4:2:2 format payload, it will impact the whole performance even there is GPU acceleration.

Please tell me about the data and process flow.

・When nvdsudpsrc is executed, the separated rtp payload is stored in the page lock host memory.
・Running rtpvrawdepay reads data from the page-locked host memory and writes the processed data to the page-locked host memory.
・Then, when nvinfer is executed, the data is copied and processed from the page lock host memory to the GPU memory.
Is this correct?

Yes. The memory looks the same as the other system memory which can be read/write by CPU

No. The output frames are in ordinary system memory. This is why we ask you to use compressed video data(H264, H265, vp9, …) instead of using raw video data(RGB, YUV,…)

nvvideoconvert will copy the frames into CUDA memory. So nvstreammux and nvinfer works on CUDA memory.

This page (Gst-nvdsudpsrc — DeepStream 6.3 Release documentation)
It says as follows.
The payload can be copied directly into GPU (pinned) memory, but the header is always in system memory.

Therefore, the rtp payload is copied to both the page-lock system memory and GPU memory.
Is this correct?

In that case, the rtp payload copied to the GPU memory appears to be unused.

Are there any plans to implement the RTP level stack in the future?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

The page-lock system memory can be accessed by both CPU and GPU.

There is no extra copy.

No.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.