Share cuda memory with gstreamer

• Hardware Platform (Jetson / GPU)
GPU

• DeepStream Version
DeepStream 7.0

• JetPack Version (valid for Jetson only)

• TensorRT Version

• NVIDIA GPU Driver Version (valid for GPU only)
535.129.03

• Issue Type (questions, new requirements, bugs)
Questions

• How to reproduce the issue? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
N/A

• Requirement details (This is for new requirement. Including the module name—for which plugin or for which sample application, the function description)
Hi, I want to achieve zero-copy transmission of inference results or images from a PyTorch model to a GStreamer pipeline for further processing and display in Python, in order to avoid unnecessary data copying and improve overall performance and efficiency. Could you please advise how to implement this? Are there any sample implementations or recommended solutions?

Hi,

It depends on the type of processing you intend to perform in GStreamer. Most standard GStreamer elements operate on CPU memory, while NVIDIA GStreamer elements (from the DeepStream SDK) work with NVMM buffers for GPU-accelerated processing. You would still need a memory copy (from GPU to GPU) to use DeepStream since the nvstreammux move regular NVMM memory to “batched” NVMM which is used in all DeepStream elements.

If you want to use NVIDIA elements without additional memory copies, you’ll need to work with the NvBufSurface API to allocate and manage NVMM buffers. However, I’m not sure if it’s currently possible to create buffers directly from Python, my experience has been with the C API.

Once you have an NVMM buffer, you can push it into a GStreamer pipeline using appsrc, as long as the caps are set correctly (video/x-raw(memory:NVMM)).

Hi, thank you so much for your detailed and insightful reply — it really helped for me to clarify the memory flow between CPU, GPU, and NVMM in the context of GStreamer and DeepStream. I truly appreciate the time you took to break it down so clearly.

As a follow-up, I’m currently trying to push video frames originating from PyTorch (CUDA tensors) directly into a GStreamer pipeline with minimal or zero memory copy. From what you explained, it sounds like working with the NvBufSurface API is the right direction — which makes a lot of sense.

That said, I was wondering:

Do you happen to have any code examples or references demonstrating how to allocate NvBufSurface memory and map PyTorch CUDA memory (or raw CUDA device pointers) into NVMM buffers?

Alternatively, are there any SDK samples or documentation links you’d recommend that cover this type of integration? I’ve mostly seen C API usage, but I’d like to bridge this with PyTorch workflows (potentially via pybind11 or similar).

Thanks again for your help — I really appreciate it!

Hi,

There’s no direct example for mapping PyTorch tensors into NvBufSurface, but DeepStream does support creating surfaces from dmabuf via:

int NvBufSurfaceCreate(NvBufSurface **surf, uint32_t batchSize,
                       NvBufSurfaceCreateParams *params);

You’ll find the relevant definitions in:

/opt/nvidia/deepstream/deepstream-7.0/sources/includes/nvbufsurface.h

If you have a valid dmabuf FD, just fill in the NvBufSurfaceMapParams accordingly. Also, DeepStream provides Python bindings via pyds:

What kind of processing do you want with the GStreamer pipeline?

What do you mean by" display in Python"?

What kind of CUDA tensors? The original RGB data from the images or the NCHW/NHWC/NWC data which is converted from the original RGB data?

Hi, thank you for your response. Let me clarify my goal and current setup:

Objective:

I want to stream high-resolution images (e.g., 4K) generated by a PyTorch model (running on GPU) to a client using GStreamer RTP pipeline with as few memory copies as possible, ideally achieving zero-copy GPU processing.

CUDA tensors

The PyTorch model outputs standard RGB images as CUDA tensors, in the format:

shape: (2160, 3840, 3), dtype: torch.uint8, device: 'cuda'

So these are raw RGB frames in HWC layout, directly from the model, not preprocessed NCHW/NHWC tensors used for inference.

Processing in GStreamer

The GStreamer pipeline is primarily used for real-time H.264 hardware-accelerated encoding and RTP transmission. There is no additional inference or display processing in the pipeline itself. The steps are:

  1. Receive RGB frames (from PyTorch, on GPU);
  2. Convert to NVMM I420 format (if needed);
  3. Encode via nvv4l2h264enc (using NVENC hardware encoder);
  4. Packetize with rtph264pay;
  5. Transmit over UDP via udpsink.

Here is the pipeline I am currently using:

self.pipe ="rtpbin name=rtpbin rtp-profile=avpf " \
           "! appsrc name=source format=TIME is-live=true block=false " \
           " caps=video/x-raw,format=RGB,width=3840,height=2160 " \
           "! videoconvert ! nvvideoconvert ! video/x-raw(memory:NVMM),format=I420" \
           "! nvv4l2h264enc name=video_enc " \
           "! video/x-h264,stream-format=byte-stream " \
           "! queue " \
           "! h264parse " \
           "! rtph264pay mtu=1300 ssrc={} pt={} " \
           "! rtprtxqueue max-size-time=1000 max-size-packets=0 " \
           "! rtpbin.send_rtp_sink_0 rtpbin.send_rtp_src_0 " \
           "! udpsink host={} port={} " \
           "rtpbin.send_rtcp_src_0 " \
           "! udpsink host={} port={} " \
           "sync=false async=false ".format(VIDEO_SSRC,VIDEO_PT,ip,port,ip,rtcpPort)

My concern

I notice that the GPU memory consumption is relatively high (a few GB), and, to my knowledge, it is caused by unnecessary CPU-GPU or GPU-GPU memory copies during the pipeline. So, I would like to eliminate intermediate copies and feed my GPU-resident PyTorch tensors (RGB) directly into the pipeline, ideally using NVMM buffers to avoid extra conversions.

Thanks for your detailed and patient reply! I’ll look into how to obtain a DMA buffer from the original RGB images and then use it to create NvBufSurface instances accordingly.

Actually DeepStream SDK is mainly for inferencing. You have decided to use Pytorch to do the inferencing, the output of your super resolution model is actually stored in the CUDA buffer. The suggestion Share cuda memory with gstreamer - #5 by miguel.taylor may help you, but I don’t think you can get dmabuf fd from pytorch APIs. You need to do the GPU to GPU memory copy from the pytorch cuda memory to the DeepStream NvBufSurface cuda memory.

Thank you for your helpful suggestions — I now have a much clearer understanding of the current situation.

I’d like to ask a follow-up question: as you mentioned, if we want to perform a GPU-to-GPU memory copy from the PyTorch CUDA memory to the DeepStream NvBufSurface CUDA memory, what would be the recommended way to do this? Is there any example or test?

Also, could you please confirm how many memory copies are involved in this process? As far as I understand, there will be one GPU-to-GPU copy — specifically:
PyTorch CUDA tensor → CuPy (to access the pointer) → DeepStream NvBufSurface.
Is that correct?

Thanks again for your continued patience and guidance!

We used to do the CUDA memory copy with the CUDA API cudaMemcpyAsync() in c/c++. You may consult CUDA forum for how to do the copy in python. Latest CUDA/CUDA Programming and Performance topics - NVIDIA Developer Forums

From DeepStream point of view, we’d suggest you to use DeepStream to do the whole inferencing from beginning.

One copy for one frame. The larger the resolution is, the slower the copy runs.

We developed a custom element for a client that replaced the sgie nvinfer element in a DeepStream pipeline and used PyTorch for inference. It essentially replicated the functionality of a secondary inference model in DeepStream, but with PyTorch, and added the resulting metadata to the DeepStream meta.

Something similar might work for you: you could build a complete DeepStream pipeline and develop a custom Python GStreamer element that maps the GStreamer buffer (batched NVMM) to a NumPy array, processes it with PyTorch, and attaches the output as DeepStream metadata. That way, you can even use nvdsosd for on-screen display if needed.

Thank you very much for your helpful suggestion!

In my current project, I would like to first try converting the PyTorch tensor to a CuPy array, and then use the relevant functions from pyds to create an NvBufSurface, performing only a single GPU-to-GPU memory copy in the process. This approach may help reduce redundant memory transfers while keeping the pipeline GPU-efficient.

I’m looking forward to hearing your thoughts on whether this approach is feasible within the DeepStream framework.

Thanks again for your support and guidance!

You can create NvBufSurface and copy your cuda data by GPU in appsrc. But there is only appsrc sample in c: /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-appsrc-cuda-test

Thanks a lot for helpful suggestion! I will try it soon!

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.