I would like to ask you a new question about the following matter that I asked you before.
In my previous question, I heard that GPUDirect has little effect on compressed data.
Processing was performed with uncompressed data.
However, the effect of GPUDirect cannot be obtained even with uncompressed data.
Because rtpvrawdepay is included, we assume that uncompressed data will be copied into system memory and the effect of GPU DIRECT will not be achieved.
Is this correct?
If so, what pipeline would you write to make RIVERMAX work without copying it into system memory?
I think the original description is Rivermax’s GPUDirect utilizes the high speed PCIe interface to pass the data directly to and from the GPU without burdening the CPU coresNVIDIA Rivermax: Optimized Networking SDK for Data | NVIDIA Developer, the implementation is just UDP stack level. RTP level stack is not implemented. So unavoidably the RTP payload is copied from GPU memory to CPU memory to be handled with the open source RTP stack.
It is better to transfer compressed video data through network(LAN, WAN,…), there is HW video decoder(decompress) to accelerate the whole pipeline with GPU memory.
According to this, rtpvrawdepya and rtph264depay don’t seem necessary.
Why are rtpvrawdepya and rtph264depay necessary ?
In your prvious answer, you said as follows.
So unavoidably the RTP payload is copied from GPU memory to CPU memory to be handled with the open source RTP stack.
According to this, data flow inside of nvdsudpsrc seems to be the following:
・data flow inside of nvdsudpsrc
--------> CX6 --------------------------> GPU -----------------> CPU
‘network’ ‘rtp payload’ ‘rtp payload’
In other words, we understand that it sends rtp payloads from the network to the GPU and then to the CPU. Why does it transfer to the GPU instead of directly to the CPU?
How is the data sent to the GPU by nvdsudpsrc supposed to be used?
Any udp src only get the rtp payload out. The rtp depay element will parse the rtp payload to meaningful frames according to the formats(h264, RGB, YCbCr-4:4:4, ac3,…)
The page-locked memory on the host is used. It is already optimized. The key is you use YCbCr-4:2:2 format payload, it will impact the whole performance even there is GPU acceleration.
・When nvdsudpsrc is executed, the separated rtp payload is stored in the page lock host memory.
・Running rtpvrawdepay reads data from the page-locked host memory and writes the processed data to the page-locked host memory.
・Then, when nvinfer is executed, the data is copied and processed from the page lock host memory to the GPU memory.
Is this correct?
Yes. The memory looks the same as the other system memory which can be read/write by CPU
No. The output frames are in ordinary system memory. This is why we ask you to use compressed video data(H264, H265, vp9, …) instead of using raw video data(RGB, YUV,…)
nvvideoconvert will copy the frames into CUDA memory. So nvstreammux and nvinfer works on CUDA memory.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
The page-lock system memory can be accessed by both CPU and GPU.