Hi, thank you for your response. Let me clarify my goal and current setup:
Objective:
I want to stream high-resolution images (e.g., 4K) generated by a PyTorch model (running on GPU) to a client using GStreamer RTP pipeline with as few memory copies as possible, ideally achieving zero-copy GPU processing.
CUDA tensors
The PyTorch model outputs standard RGB images as CUDA tensors, in the format:
shape: (2160, 3840, 3), dtype: torch.uint8, device: 'cuda'
So these are raw RGB frames in HWC layout, directly from the model, not preprocessed NCHW/NHWC tensors used for inference.
Processing in GStreamer
The GStreamer pipeline is primarily used for real-time H.264 hardware-accelerated encoding and RTP transmission. There is no additional inference or display processing in the pipeline itself. The steps are:
- Receive RGB frames (from PyTorch, on GPU);
- Convert to NVMM I420 format (if needed);
- Encode via
nvv4l2h264enc (using NVENC hardware encoder);
- Packetize with
rtph264pay;
- Transmit over UDP via
udpsink.
Here is the pipeline I am currently using:
self.pipe ="rtpbin name=rtpbin rtp-profile=avpf " \
"! appsrc name=source format=TIME is-live=true block=false " \
" caps=video/x-raw,format=RGB,width=3840,height=2160 " \
"! videoconvert ! nvvideoconvert ! video/x-raw(memory:NVMM),format=I420" \
"! nvv4l2h264enc name=video_enc " \
"! video/x-h264,stream-format=byte-stream " \
"! queue " \
"! h264parse " \
"! rtph264pay mtu=1300 ssrc={} pt={} " \
"! rtprtxqueue max-size-time=1000 max-size-packets=0 " \
"! rtpbin.send_rtp_sink_0 rtpbin.send_rtp_src_0 " \
"! udpsink host={} port={} " \
"rtpbin.send_rtcp_src_0 " \
"! udpsink host={} port={} " \
"sync=false async=false ".format(VIDEO_SSRC,VIDEO_PT,ip,port,ip,rtcpPort)
My concern
I notice that the GPU memory consumption is relatively high (a few GB), and, to my knowledge, it is caused by unnecessary CPU-GPU or GPU-GPU memory copies during the pipeline. So, I would like to eliminate intermediate copies and feed my GPU-resident PyTorch tensors (RGB) directly into the pipeline, ideally using NVMM buffers to avoid extra conversions.