DeepStream zero-copy + run cuda kernel directly (Jetson Orin Nx)

SungJinWoo · July 13, 2025, 2:45pm

hello,

i am using nvidia jetson orin nx for robotics applications.
my JetPack version is 5.1.2 with full packages like CUDA-11.4 and DeepStream-6.3
also TensforRT 8.5.2.2 with Cudnn 8.6

i am trying to create custom gstreamer pipeline to implement zero-copy techniques due to reduction of latency between gpu-cpu transfer and found DeepStream could ease this way with NvBufSurface header.
in my case i have camera and run below pipeline to apply my custom gstreamer plugin on frames. i used dsexample code and wanna apply my custom cuda kernel on frames.

my gstreamer pipeline : gst-launch-1.0 nvv4l2camerasrc ! ‘video/x-raw(memory:NVMM), format=UYVY, width=1920, height=1080’ ! nvvideoconvert copy-hw=2 nvbuf-memory-type=4 ! ‘video/x-raw(memory:NVMM), format=RGBA’ ! queue ! mux.sink_0 nvstreammux name=mux batch-size=1 width=1920 height=1080 live-source=1 nvbuf-memory-type=4 ! dsexample ! nvvideoconvert copy-hw=2 nvbuf-memory-type=4 ! xvimagesink sync=false

but in dsexample plugin i dont understand how can i access to camera frames to push them into cuda kernel, my simple cuda kernel that i tested with cudaMallocPitch() and worked wrote like this:

#include <cuda_runtime.h>
#include <stdint.h>
#include <iostream>

extern "C" void launch_bgr2gray_rgba(
    uint8_t* rgba_ptr, int width, int height, int pitch, cudaStream_t stream);

__global__ void rgba_to_gray_rgba_kernel(uint8_t* rgba, int width, int height, int pitch) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= width || y >= height) return;

    uint8_t* pixel = rgba + y * pitch + x * 4;
    uint8_t r = pixel[0];
    uint8_t g = pixel[1];
    uint8_t b = pixel[2];
    uint8_t gray = static_cast<uint8_t>(0.299f * r + 0.587f * g + 0.114f * b);
    pixel[0] = gray;
    pixel[1] = gray;
    pixel[2] = gray;
    // alpha stays the same
}

extern "C" void launch_bgr2gray_rgba(
    uint8_t* rgba_ptr, int width, int height, int pitch, cudaStream_t stream) {

    dim3 block(16, 16);
    dim3 grid((width + block.x - 1) / block.x,
              (height + block.y - 1) / block.y);

    rgba_to_gray_rgba_kernel<<<grid, block, 0, stream>>>(rgba_ptr, width, height, pitch);

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("CUDA kernel launch failed: %s\n", cudaGetErrorString(err));
    }
}

using cudaMallocPitch() have time cast and i wanna run this cuda kernel with zero-copy teqnique. now i need help to how can i access rgba_ptr in dsexample code for run custom cuda kernel.

static GstFlowReturn
gst_dsexample_transform_ip (GstBaseTransform * btrans, GstBuffer * inbuf)
{
  GstDsExample *dsexample = GST_DSEXAMPLE (btrans);
  GstMapInfo in_map_info;
  GstFlowReturn flow_ret = GST_FLOW_ERROR;
  gdouble scale_ratio = 1.0;
  DsExampleOutput *output;

  NvBufSurface *surface = NULL;
  NvDsBatchMeta *batch_meta = NULL;
  NvDsFrameMeta *frame_meta = NULL;
  NvDsMetaList * l_frame = NULL;
  guint i = 0;

  dsexample->frame_num++;
  CHECK_CUDA_STATUS (cudaSetDevice (dsexample->gpu_id),
      "Unable to set cuda device");

  memset (&in_map_info, 0, sizeof (in_map_info));
  if (!gst_buffer_map (inbuf, &in_map_info, GST_MAP_READ)) {
    g_print ("Error: Failed to map gst buffer\n");
    goto error;
  }

  nvds_set_input_system_timestamp (inbuf, GST_ELEMENT_NAME (dsexample));
  surface = (NvBufSurface *) in_map_info.data;
  GST_DEBUG_OBJECT (dsexample,
      "Processing Frame %" G_GUINT64_FORMAT " Surface %p\n",
      dsexample->frame_num, surface);

  if (CHECK_NVDS_MEMORY_AND_GPUID (dsexample, surface))
    goto error;

  batch_meta = gst_buffer_get_nvds_batch_meta (inbuf);
  if (batch_meta == nullptr) {
    GST_ELEMENT_ERROR (dsexample, STREAM, FAILED,
        ("NvDsBatchMeta not found for input buffer."), (NULL));
    return GST_FLOW_ERROR;
  }

  GST_DEBUG("Memory type: %d (4=NVBUF_MEM_SURFACE_ARRAY)", surface->memType);
  if (surface->memType != NVBUF_MEM_SURFACE_ARRAY) {
    GST_WARNING("Unexpected memory type - may need different handling");
  }

  if (dsexample->process_full_frame) {
    for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next)
    {
      frame_meta = (NvDsFrameMeta *) (l_frame->data);
      int batch_id = frame_meta->batch_id;

      ///////////////////////////// naji ///////////////////////////////

      
      NvBufSurfaceParams *params = &surface->surfaceList[batch_id];

      if (NvBufSurfaceMap(surface, batch_id, 0, NVBUF_MAP_READ_WRITE) != 0) {
          GST_ERROR("Failed to map surface for batch %d", batch_id);
          goto error;
      }
      NvBufSurfaceSyncForCpu(surface, batch_id, 0);

      uint8_t* device_ptr = (uint8_t*)surface->surfaceList[0].dataPtr;
      int width = params->width;
      int height = params->height;
      int pitch = params->pitch;

      std::cout<<"width = "<<width<<"height = "<<height<<"pitch = "<<pitch<<std::endl;

      launch_bgr2gray_rgba(device_ptr, width, height, pitch, 0);  // stream 0 for now

      NvBufSurfaceSyncForDevice(surface, batch_id, 0);
      NvBufSurfaceUnMap(surface, batch_id, 0);



      ///////////////////////////// naji ///////////////////////////////

      NvOSD_RectParams rect_params;

      /* Scale the entire frame to processing resolution */
      rect_params.left = 0;
      rect_params.top = 0;
      rect_params.width = dsexample->video_info.width;
      rect_params.height = dsexample->video_info.height;

      /* Scale and convert the frame */
      if (get_converted_mat (dsexample, surface, i, &rect_params,
            scale_ratio, dsexample->video_info.width,
            dsexample->video_info.height) != GST_FLOW_OK) {
        goto error;
      }

      /* Process to get the output */
#ifdef WITH_OPENCV
      output =
          DsExampleProcess (dsexample->dsexamplelib_ctx,
          dsexample->cvmat->data);
#else
      output =
          DsExampleProcess (dsexample->dsexamplelib_ctx,
          (unsigned char *)dsexample->inter_buf->surfaceList[0].mappedAddr.addr[0]);
#endif
      /* Attach the metadata for the full frame */
      attach_metadata_full_frame (dsexample, frame_meta, scale_ratio, output, i);
      i++;
      free (output);
    }

  } else {
    /* Using object crops as input to the algorithm. The objects are detected by
     * the primary detector */
    NvDsMetaList * l_obj = NULL;
    NvDsObjectMeta *obj_meta = NULL;

    if(!dsexample->is_integrated) {
      if (dsexample->blur_objects) {
        if (!(surface->memType == NVBUF_MEM_CUDA_UNIFIED || surface->memType == NVBUF_MEM_CUDA_PINNED)){
          GST_ELEMENT_ERROR (dsexample, STREAM, FAILED,
              ("%s:need NVBUF_MEM_CUDA_UNIFIED or NVBUF_MEM_CUDA_PINNED memory for opencv blurring",__func__), (NULL));
          return GST_FLOW_ERROR;
        }
      }
    }

    for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
      l_frame = l_frame->next)
    {
      frame_meta = (NvDsFrameMeta *) (l_frame->data);

#ifdef WITH_OPENCV
      cv::Mat in_mat;

      if (dsexample->blur_objects) {
        /* Map the buffer so that it can be accessed by CPU */
        if (surface->surfaceList[frame_meta->batch_id].mappedAddr.addr[0] == NULL){
          if (NvBufSurfaceMap (surface, frame_meta->batch_id, 0, NVBUF_MAP_READ_WRITE) != 0){
            GST_ELEMENT_ERROR (dsexample, STREAM, FAILED,
                ("%s:buffer map to be accessed by CPU failed", __func__), (NULL));
            return GST_FLOW_ERROR;
          }
        }

        /* Cache the mapped data for CPU access */
        if(dsexample->inter_buf->memType == NVBUF_MEM_SURFACE_ARRAY)
          NvBufSurfaceSyncForCpu (surface, frame_meta->batch_id, 0);

        in_mat =
            cv::Mat (surface->surfaceList[frame_meta->batch_id].planeParams.height[0],
            surface->surfaceList[frame_meta->batch_id].planeParams.width[0], CV_8UC4,
            surface->surfaceList[frame_meta->batch_id].mappedAddr.addr[0],
            surface->surfaceList[frame_meta->batch_id].planeParams.pitch[0]);
      }
#endif

      for (l_obj = frame_meta->obj_meta_list; l_obj != NULL;
          l_obj = l_obj->next)
      {
        obj_meta = (NvDsObjectMeta *) (l_obj->data);

        if (dsexample->blur_objects) {
          /* gaussian blur the detected objects using opencv */
#ifdef WITH_OPENCV
          if (blur_objects (dsexample, frame_meta->batch_id,
            &obj_meta->rect_params, in_mat) != GST_FLOW_OK) {
          /* Error in blurring, skip processing on object. */
            GST_ELEMENT_ERROR (dsexample, STREAM, FAILED,
            ("blurring the object failed"), (NULL));
            if (NvBufSurfaceUnMap (surface, frame_meta->batch_id, 0)){
              GST_ELEMENT_ERROR (dsexample, STREAM, FAILED,
                ("%s:buffer unmap to be accessed by CPU failed", __func__), (NULL));
            }
            return GST_FLOW_ERROR;
          }
          continue;
#else
          GST_ELEMENT_ERROR (dsexample, STREAM, FAILED,
          ("OpenCV has been deprecated, hence object blurring will not work."
          "Enable OpenCV compilation in gst-dsexample Makefile by setting 'WITH_OPENCV:=1"), (NULL));
          return GST_FLOW_ERROR;
#endif
        }

        /* Should not process on objects smaller than MIN_INPUT_OBJECT_WIDTH x MIN_INPUT_OBJECT_HEIGHT
         * since it will cause hardware scaling issues. */
        if (obj_meta->rect_params.width < MIN_INPUT_OBJECT_WIDTH ||
            obj_meta->rect_params.height < MIN_INPUT_OBJECT_HEIGHT)
          continue;

        /* Crop and scale the object */
        if (get_converted_mat (dsexample,
              surface, frame_meta->batch_id, &obj_meta->rect_params,
              scale_ratio, dsexample->video_info.width,
              dsexample->video_info.height) != GST_FLOW_OK) {
          /* Error in conversion, skip processing on object. */
          continue;
        }

#ifdef WITH_OPENCV
        /* Process the object crop to obtain label */
        output = DsExampleProcess (dsexample->dsexamplelib_ctx,
            dsexample->cvmat->data);
#else
        /* Process the object crop to obtain label */
        output = DsExampleProcess (dsexample->dsexamplelib_ctx,
            (unsigned char *)dsexample->inter_buf->surfaceList[0].mappedAddr.addr[0]);
#endif

        /* Attach labels for the object */
        attach_metadata_object (dsexample, obj_meta, output);

        free (output);
      }

      if (dsexample->blur_objects) {
      /* Cache the mapped data for device access */
        if(dsexample->inter_buf->memType == NVBUF_MEM_SURFACE_ARRAY) 
          NvBufSurfaceSyncForDevice (surface, frame_meta->batch_id, 0);

#ifdef WITH_OPENCV
#ifdef DSEXAMPLE_DEBUG
        /* Use openCV to remove padding and convert RGBA to BGR. Can be skipped if
        * algorithm can handle padded RGBA data. */
#if (CV_MAJOR_VERSION >= 4)
        cv::cvtColor (in_mat, *dsexample->cvmat, cv::COLOR_RGBA2BGR);
#else
        cv::cvtColor (in_mat, *dsexample->cvmat, CV_RGBA2BGR);
#endif
        /* used to dump the converted mat to files for debug */
        static guint cnt = 0;
        cv::imwrite("out_" + std::to_string (cnt) + ".jpeg", *dsexample->cvmat);
        cnt++;
#endif
#endif
      }
    }
  }
  flow_ret = GST_FLOW_OK;

error:

  nvds_set_output_system_timestamp (inbuf, GST_ELEMENT_NAME (dsexample));
  gst_buffer_unmap (inbuf, &in_map_info);
  return flow_ret;
}

Thanks …

Fiona.Chen · July 14, 2025, 3:09am

Why do you need to use nvstreammux in your pipeline?

The gst-dsexample plugin is a “in-place” transform plugin which will not pass the processed buffer to downstream. Since you “wanna apply my custom cuda kernel on frames”, will you need the processed frames to be passed to downstream?

The video frame data is stored in NvBufSurface in DeepStream. NVIDIA DeepStream SDK API Reference: NvBufSurface Types and Functions | NVIDIA Docs. You need to follow the instructions in get_converted_mat() function in /opt/nvidia/deepstream/deepstream/sources/gst-plugins/gst-dsexample/gstdsexample.cpp to get the CUDA buffer from NvBufSurface.

In get_converted_mat() function, there is a piece of code

  if(dsexample->is_integrated) {
#ifdef __aarch64__
    /* To use the converted buffer in CUDA, create an EGLImage and then use
    * CUDA-EGL interop APIs */
    if (USE_EGLIMAGE) {
      if (NvBufSurfaceMapEglImage (dsexample->inter_buf, 0) !=0 ) {
        goto error;
      }

      /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
      * Use interop APIs cuGraphicsEGLRegisterImage and
      * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */

      /* Destroy the EGLImage */
      NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);
    }
#endif
  }

The cuGraphicsEGLRegisterImage and cuGraphicsResourceGetMappedEglFrame are all CUDA interfaces. You can refer to the CUDA document for the usage. CUDA Driver API :: CUDA Toolkit Documentation

SungJinWoo · July 14, 2025, 5:33pm

hello sir,
i confused, i know that jetson boards support NVBUF_MEM_SURFACE_ARRAY memory type so i cant use cuda kernel directly on NvBufSurface data and some mapping required.
also i used nvstreammux in my pipeline for handling of gst_buffer_get_nvds_batch_meta function used in gstdsexample.cpp code: (is it necessary?)

gst_dsexample_transform_ip (GstBaseTransform * btrans, GstBuffer * inbuf)
{
  GstDsExample *dsexample = GST_DSEXAMPLE (btrans);
  GstMapInfo in_map_info;
  GstFlowReturn flow_ret = GST_FLOW_ERROR;
  NvBufSurface *surface = NULL;
  NvDsBatchMeta *batch_meta = NULL;
  NvDsFrameMeta *frame_meta = NULL;
  NvDsMetaList *l_frame = NULL;
  guint i = 0;

  dsexample->frame_num++;
  CHECK_CUDA_STATUS (cudaSetDevice (dsexample->gpu_id),
      "Unable to set cuda device");

  memset (&in_map_info, 0, sizeof (in_map_info));
  if (!gst_buffer_map (inbuf, &in_map_info, GST_MAP_READWRITE)) {
    g_print ("Error: Failed to map gst buffer\n");
    goto error;
  }

  nvds_set_input_system_timestamp (inbuf, GST_ELEMENT_NAME (dsexample));
  surface = (NvBufSurface *) in_map_info.data;

  GST_DEBUG_OBJECT (dsexample,
      "Processing Frame %" G_GUINT64_FORMAT " Surface %p\n",
      dsexample->frame_num, surface);

  if (CHECK_NVDS_MEMORY_AND_GPUID (dsexample, surface))
    goto error;

  batch_meta = gst_buffer_get_nvds_batch_meta (inbuf);
  if (batch_meta == nullptr) {
    GST_ELEMENT_ERROR (dsexample, STREAM, FAILED,
        ("NvDsBatchMeta not found for input buffer."), (NULL));
    goto error;
  }

  if (surface->memType != NVBUF_MEM_SURFACE_ARRAY) {
    GST_WARNING("Unexpected memory type: %d", surface->memType);
    goto error;
  }

  for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
       l_frame = l_frame->next)
  {
    frame_meta = (NvDsFrameMeta *) (l_frame->data);
    int batch_id = frame_meta->batch_id;

i tried to solve mapping problem by adding this part of code to my gst_dsexample_transform_ip function:

for (guint batch_id = 0; batch_id < surface->batchSize; batch_id++) {
    NvBufSurfaceParams *params = &surface->surfaceList[batch_id];

    if (batch_id >= surface->batchSize) {
        GST_ERROR("Invalid batch_id %d (batchSize=%d)", batch_id, surface->batchSize);
        continue;
    }

    if (params->dataPtr == nullptr) {
        g_print("dataPtr is null for batch_id %d\n", batch_id);
        continue;
    }

    if (params->width == 0 || params->height == 0 || params->pitch == 0) {
        g_print("Invalid frame params (width/height/pitch) for batch_id %d\n", batch_id);
        continue;
    }

    g_print("Batch %d: width=%d height=%d pitch=%d dataPtr=%p\n",
        batch_id, params->width, params->height, params->pitch, params->dataPtr);

    int plane = 0;
    if (NvBufSurfaceMap(surface, batch_id, plane, NVBUF_MAP_READ_WRITE) != 0) {
        GST_ERROR("Failed to map surface for batch_id %d", batch_id);
        continue;
    }

    if (NvBufSurfaceSyncForDevice(surface, batch_id, dsexample->gpu_id) != 0) {
        GST_ERROR("Failed to sync surface for device access");
        NvBufSurfaceUnMap(surface, batch_id, plane);
        continue;
    }

    cudaPointerAttributes attr;
    if (cudaPointerGetAttributes(&attr, params->dataPtr) != cudaSuccess) {
        GST_ERROR("cudaPointerGetAttributes failed for batch %d", batch_id);
        NvBufSurfaceUnMap(surface, batch_id, plane);
        continue;
    }

    if (attr.type == cudaMemoryTypeDevice) {
        rgb2gray_inplace(
          (uchar4*)dsexample->inter_buf->surfaceList[0].dataPtr,
          surface->surfaceList[batch_id].width,
          surface->surfaceList[batch_id].height,
          surface->surfaceList[batch_id].pitch,
          dsexample->cuda_stream);
    } else {
        GST_WARNING("dataPtr is not CUDA device memory (attr.type=%d). Copying to inter_buf.", attr.type);

        if (cudaMemcpyAsync(dsexample->inter_buf->surfaceList[0].dataPtr,
                        params->dataPtr,
                        params->pitch * params->height,
                        cudaMemcpyHostToDevice,
                        dsexample->cuda_stream) != cudaSuccess) {
            GST_ERROR("cudaMemcpyAsync to inter_buf failed for batch_id %d", batch_id);
            NvBufSurfaceUnMap(surface, batch_id, plane);
            continue;
        }

        cudaStreamSynchronize(dsexample->cuda_stream);

        rgb2gray_inplace(
          (uchar4*)dsexample->inter_buf->surfaceList[0].dataPtr,
          surface->surfaceList[batch_id].width,
          surface->surfaceList[batch_id].height,
          surface->surfaceList[batch_id].pitch,
          dsexample->cuda_stream);
    }

    NvBufSurfaceUnMap(surface, batch_id, plane);
}

it run without any runtime bug but my frames didn’t convert to gray (from RGBA) and i got this cuda error in my cuda kernel:

CUDA error: an illegal memory access was encountered

so i need help, how can i mapping NvBufSurface correctly to desire data pointer and push it into my cuda kernel. i am really excited about part of code you told me about another mapping:

  if(dsexample->is_integrated) {
#ifdef __aarch64__
    /* To use the converted buffer in CUDA, create an EGLImage and then use
    * CUDA-EGL interop APIs */
    if (USE_EGLIMAGE) {
      if (NvBufSurfaceMapEglImage (dsexample->inter_buf, 0) !=0 ) {
        goto error;
      }

      /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
      * Use interop APIs cuGraphicsEGLRegisterImage and
      * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */

      /* Destroy the EGLImage */
      NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);
    }
#endif
  }

Thanks so much …

Fiona.Chen · July 15, 2025, 2:40am

If you don’t need to do inferencing, you don’t need to convert the data to batched data by nvstreammux. If you need the result of “wanna apply my custom cuda kernel on frames” to be available in the downstream elements in the pipeline, gst-dsexample is not suitable too. If your cuda algorithm generate another format OR another resolution OR another FPS OR any other caps of data from the frame data, you need a “transform” transformation plugin. Please explain what kind of CUDA algorithm you will apply, what is the output? Do you want the output to be available in downstream elements in the pipeline?

I’ve told you the only part related to mapping and getting the CUDA buffer from the NvBufSurface is the piece of code

  if(dsexample->is_integrated) {
#ifdef __aarch64__
    /* To use the converted buffer in CUDA, create an EGLImage and then use
    * CUDA-EGL interop APIs */
    if (USE_EGLIMAGE) {
      if (NvBufSurfaceMapEglImage (dsexample->inter_buf, 0) !=0 ) {
        goto error;
      }

      /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
      * Use interop APIs cuGraphicsEGLRegisterImage and
      * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */

      /* Destroy the EGLImage */
      NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);
    }
#endif
  }

The dsexample->inter_buf can be any NvBufSurface pointer, the NvBufSurfaceMapEglImage+cuGraphicsEGLRegisterImage+cuGraphicsResourceGetMappedEglFrame+ NvBufSurfaceUnMapEglImage is what you need. The explanation and usage of the APIs can be found in NVIDIA DeepStream SDK API Reference: NvBufSurface Types and Functions | NVIDIA Docs and CUDA Driver API :: CUDA Toolkit Documentation

SungJinWoo · July 17, 2025, 11:21am

hello,
Thanks for answering,
all my CUDA kernels are similar to following code:

#include <cuda_runtime.h>
#include <stdint.h>
#include <iostream>

extern "C" void launch_bgr2gray_rgba(
    uint8_t* rgba_ptr, int width, int height, int pitch, cudaStream_t stream);

__global__ void rgba_to_gray_rgba_kernel(uint8_t* rgba, int width, int height, int pitch) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= width || y >= height) return;

    uint8_t* pixel = rgba + y * pitch + x * 4;
    uint8_t r = pixel[0];
    uint8_t g = pixel[1];
    uint8_t b = pixel[2];
    uint8_t gray = static_cast<uint8_t>(0.299f * r + 0.587f * g + 0.114f * b);
    pixel[0] = gray;
    pixel[1] = gray;
    pixel[2] = gray;
    // alpha stays the same
}

extern "C" void launch_bgr2gray_rgba(
    uint8_t* rgba_ptr, int width, int height, int pitch, cudaStream_t stream) {

    dim3 block(16, 16);
    dim3 grid((width + block.x - 1) / block.x,
              (height + block.y - 1) / block.y);

    rgba_to_gray_rgba_kernel<<<grid, block, 0, stream>>>(rgba_ptr, width, height, pitch);

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("CUDA kernel launch failed: %s\n", cudaGetErrorString(err));
    }
}

I will push frames to my c++ application with appsink plugin but before downstream i wanna use zero-copy techniques to apply CUDA kernels like simple RGBA2GRAY and at final step sink to c++ application.
i wanna best solution to do this but with minimum delay and latency and compatible with my sample CUDA kernels.

Thanks sir …

Fiona.Chen · July 18, 2025, 9:50am

Seems it is OK.

SungJinWoo · July 20, 2025, 7:15pm

hello,
ok, so if i use this part of code and map data based on EGLIMAGE techniques, i could handle cuda memory to apply my custom cuda kernel?
in next step and after UnMap i could downstream frames to use in python or c++ application by using appsink plugin?
Thanks …

Fiona.Chen:

  if(dsexample->is_integrated) {
#ifdef __aarch64__
    /* To use the converted buffer in CUDA, create an EGLImage and then use
    * CUDA-EGL interop APIs */
    if (USE_EGLIMAGE) {
      if (NvBufSurfaceMapEglImage (dsexample->inter_buf, 0) !=0 ) {
        goto error;
      }

      /* dsexample->inter_buf->surfaceList[0].mappedAddr.eglImage
      * Use interop APIs cuGraphicsEGLRegisterImage and
      * cuGraphicsResourceGetMappedEglFrame to access the buffer in CUDA */

      /* Destroy the EGLImage */
      NvBufSurfaceUnMapEglImage (dsexample->inter_buf, 0);
    }
#endif
  }

Fiona.Chen · July 21, 2025, 1:21am

Yes, you can get the CUDA memory from NvBufSurface.

After “unmap” the CUIDA memory is not available any more. To get the memory in NvBufSurface is a synchronized operation.

yingliu · August 22, 2025, 5:22am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks.

Topic		Replies	Views
How to ensure GPU zero-copy buffer access with GStreamer appsink on Jetson (using NvBufSurface and CUDA)? DeepStream SDK deepstream	7	461	August 12, 2025
Error generated while running the code after connecting the camera Jetson Xavier NX gstreamer , nvbugs	45	1992	January 2, 2024
Using nvbufsurface to run custom cuda kernel DeepStream SDK deepstream	7	151	December 9, 2025
Gstreamer writing to CUDA memory and zero copy cv::cuda::GpuMat with Jetpack 5.1.2 Jetson AGX Xavier camera , cuda , gstreamer	11	636	August 13, 2025
Deepstream on Jetson - getting cv::gpuMat from NvBufSurface DeepStream SDK	5	334	August 9, 2024
From nvBufSurface to gpu::Mat or cuda stream DeepStream SDK cuda , deepstream	15	423	February 6, 2025
How to avoid double copy of image from EGL stream, when creating batched nvbuf surface DeepStream SDK camera , jetson-inference , nvbugs	15	843	March 20, 2024
How to create opencv gpumat from nvstream? DeepStream SDK	36	15569	July 27, 2021
Deepstream DeepStream SDK gstreamer , jetson , deepstream	14	653	July 9, 2024
How to NvBufSurface to cv::Mat? DeepStream SDK	12	789	July 2, 2024

DeepStream zero-copy + run cuda kernel directly (Jetson Orin Nx)

Related topics