Gstreamer writing to CUDA memory and zero copy cv::cuda::GpuMat with Jetpack 5.1.2

jbaldwin · July 4, 2025, 8:25pm

Hi, I have a setup where I am streaming multiple(4) GSML2 cameras (IMX390) into my app, doing a little opencv processing then using the dustynv libraries for a webrtc stream to a locally hosted webpage. My work can all be done in GPU memory and I quickly noted that in my original implementation that a lot of time was spent reading in the frames from the openCv videosource (this copies into cpu memory, that later gets copied again into gpu).

I went down the path of trying to use gstreamer directly in my c++ code. I setup my gstreamer pipline/appsink and tied to convert into the EGLImage but many of the examples I followed made use of nvbuf_utils, which is deprecated in Jetpack 5.1.2 and I never got this working.

From my reading/understanding I needed to use the newer approach with NvBufSurface and surface transforms. Going this route while setting up the NvBufSurface I found out that my version of jetpack it doesn’t support NVBUF_MEM_CUDA_UNIFIED for memory type (Jetpack 6 feature I believe?). It only really appears to support the NVBUF_MEM_DEAFAULT, which if I’m not mistaking, will cause it to hit the CPU again.

Sorry if this post is confusing but my head is spinning a little trying to figure out what is the correct approach. All I really want is to read in camera frames to gpu memery such that I can have the cv::cuda::GpuMat point to it without hitting the cpu. Reading other forum posts it appears possible, I just haven’t pieced it all together yet and was hopping for some help, or an example.

AastaLLL · July 7, 2025, 4:20am

Hi,

Please check the two samples below:

It is expected to work on JetPack 6 as both use the NvBufSurface API.
Thanks.

baldwj · July 7, 2025, 9:15pm

I’m looking at Honey_Patouceul’s example in your second link their I noticed that nvbuf_utils.h is still being imported (this doesn’t appear to exist in jetpack 5.1.2). Commenting out that include however, I get the project to compile. but running any of the sample commands I always get the ERROR **: : Failed to go into PLAYING state.

I tested some gstreamer commands, running gst-launch-1.0 and see everything connect and run from command line, but the same setup is failing when running the code.

I’ll continue testing, perhaps try and merge this example with one of earlier iterations where I thought I was close and get back to you.

baldwj · July 8, 2025, 8:47pm

Okay, So following the example in your first link I was able to get everything up and running (had to use appsink as nv3dsink wasn’t working for me, even though gst-inspect-1.0 nv3dsink showed the sink was available)

Unfortunately to my surprise running this code was no faster than my original implementation, showing simular cpu utilization with all cores running about 50%. This surprised me as my old implimentation used gstreamer with openCV’s VideoCapture.read(). This method I thought would avoid the cpu bottleneck that occurs when reading frames with opencv by keeping everything in GPU.

Reading other posts I found several discussions around cuGraphicsEGLRegisterImage, cuGraphicsResourceGetMappedEglFrame, and cuGraphicsUnregisterResource running every frame, consuming a good chunk of resources. For my application I timed cuGraphicsEGLRegisterImage() and see it alone runs for ~2.6+ ms (total from all 4 cameras). The discussion didn’t have a satisfying conclusion and I was wondering if there is anymore information that can be provided on working around or avoiding registering on every frame? Can I just register the memory block once somehow?

Here is the other Discussion:

Below is my code (reads in 4 camera frames and sends them out over webrtc). Sorry, my code is very messy as I am using it as a sandbox at this point. Running quick tests, adding pieces to as it gets working, so structure and cleanliness are severely lacking.

#include <gst/gst.h>
//#include <nvbuf_utils.h>
#include <nvbufsurface.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cudaEGL.h>
#include <opencv2/core.hpp>
#include <opencv2/cudaarithm.hpp>
#include <cuda_runtime_api.h>

#include <jetson-utils/videoOutput.h>
#include <jetson-utils/cudaMappedMemory.h>
#include <jetson-utils/logging.h>

#include <vector>
#include <iostream>

#include <chrono>

videoOutput* output;

GMainLoop *loop;
GstElement *pipeline;
gint id0=0;
gint id1=1;
gint id2=2;
gint id3=3;

uchar *unified_final_buffer;
cv::cuda::GpuMat gpu_final_frame;

auto loop_start = std::chrono::high_resolution_clock::now();

void CvCudaProcessEGLImage(EGLImageKHR& egl_image, unsigned int pitch, int sensorID)
{
    //printf("EglImage at %p\n", egl_image);
    CUresult status;
    CUeglFrame eglFrame;
    CUgraphicsResource pResource = NULL;
    cudaFree(0);

//    loop_start = std::chrono::high_resolution_clock::now();
    status = cuGraphicsEGLRegisterImage(&pResource, egl_image, CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
//    auto end_time = std::chrono::high_resolution_clock::now();
//    auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end_time - loop_start).count();
//    std::cout << "Elapsed: " << elapsed << std::endl;

    if (status != CUDA_SUCCESS)
        g_printerr("cuGraphicsEGLRegisterImage failed\n");

    status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, pResource, 0, 0);
    if (status != CUDA_SUCCESS)
        g_printerr("cuGraphicsResourceGetMappedEglFrame failed\n");

    status = cuCtxSynchronize();
    if (status != CUDA_SUCCESS)
        g_printerr("cuCtxSynchronize failed\n");

    cv::cuda::GpuMat d_mat(eglFrame.height, pitch/4, CV_8UC4, (uchar*) eglFrame.frame.pPitch[0]);
    // pitch/4 is 512?
    if(sensorID==0){
        d_mat.copyTo(gpu_final_frame(cv::Rect(0, 0, 512, 320)));
    }
    else if(sensorID==1){
        d_mat.copyTo(gpu_final_frame(cv::Rect(512, 0, 512, 320)));
    }
    else if(sensorID==2){
        d_mat.copyTo(gpu_final_frame(cv::Rect(0, 320, 512, 320)));
    }
    else{
       d_mat.copyTo(gpu_final_frame(cv::Rect(512, 320, 512, 320)));
    }

    status = cuCtxSynchronize();
    if (status != CUDA_SUCCESS)
        g_printerr("cuCtxSynchronize failed\n");

    if (output && sensorID==3) {
        output->Render((uchar4*)unified_final_buffer, 1024, 640);


    }

    status = cuGraphicsUnregisterResource(pResource);
    if (status != CUDA_SUCCESS)
        g_printerr("cuGraphicsUnregisterResource failed\n");

}

static GstPadProbeReturn conv_src_pad_buffer_probe (GstPad * pad, GstPadProbeInfo * info, gpointer u_data)
{
    int sensorID = *(int*)u_data;
    GstBuffer *buffer = (GstBuffer *) info->data;
    GstMapInfo map    = {0};
    gst_buffer_map (buffer, &map, GST_MAP_WRITE);

    EGLImageKHR egl_image;

    NvBufSurface* surf = (NvBufSurface*)map.data;
    NvBufSurfaceParams params = surf->surfaceList[0];

    //g_print("Surface has width=%u, height=%u, pitch=%u, memsize=%u at %p bufferDesc=%zu\n", params.width, params.height, params.pitch, params.dataSize, params.dataPtr, params.bufferDesc);

    if(0 != NvBufSurfaceMapEglImage(surf, -1))
        g_printerr("Failed to map surface to EGLImage\n");

    egl_image = surf->surfaceList[0].mappedAddr.eglImage;
    if (egl_image == NULL)
        g_printerr("No EGLImage mapped from surface\n");

    CvCudaProcessEGLImage(egl_image, params.pitch, sensorID);
    NvBufSurfaceUnMapEglImage (surf, 0);

    gst_buffer_unmap(buffer, &map);

    return GST_PAD_PROBE_OK;
}

bool run_capture() {
    // Simplified pipeline - convert to BGR in GStreamer pipeline
    std::string pipeline_str =
        "nvarguscamerasrc sensor-id=0 ! "
        "video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, framerate=(fraction)30/1 ! "
        "nvvidconv name=conv0 ! "
        "video/x-raw(memory:NVMM), width=(int)512, height=(int)320, format=(string)RGBA ! "
        "appsink max-buffers=1 drop=true emit-signals=true sync=false";

    pipeline_str +=
        " nvarguscamerasrc sensor-id=1 ! "
        "video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, framerate=(fraction)30/1 ! "
        "nvvidconv name=conv1 ! "
        "video/x-raw(memory:NVMM), width=(int)512, height=(int)320, format=(string)RGBA ! "
        "appsink max-buffers=1 drop=true emit-signals=true sync=false";

    pipeline_str +=
        " nvarguscamerasrc sensor-id=2 ! "
        "video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, framerate=(fraction)30/1 ! "
        "nvvidconv name=conv2 ! "
        "video/x-raw(memory:NVMM), width=(int)512, height=(int)320, format=(string)RGBA ! "
        "appsink max-buffers=1 drop=true emit-signals=true sync=false";

    pipeline_str +=
        " nvarguscamerasrc sensor-id=3 ! "
        "video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, framerate=(fraction)30/1 ! "
        "nvvidconv name=conv3 ! "
        "video/x-raw(memory:NVMM), width=(int)512, height=(int)320, format=(string)RGBA ! "
        "appsink max-buffers=1 drop=true emit-signals=true sync=false";

    /* init GStreamer */
    gst_init (NULL, NULL);
    loop = g_main_loop_new (NULL, FALSE);

    pipeline = gst_parse_launch(pipeline_str.c_str(), NULL);
    //pipeline = gst_parse_launch(argv[1], NULL);
    if (pipeline == NULL)
        g_error ("Failed to launch pipeline");

    /* Try to find element with name=conv */
    GstElement *conv0 = gst_bin_get_by_name(GST_BIN(pipeline), "conv0");
    if (conv0 == NULL)
        g_error ("Failed to find conv in pipeline 0");

    GstElement *conv1 = gst_bin_get_by_name(GST_BIN(pipeline), "conv1");
    if (conv1 == NULL)
        g_error ("Failed to find conv in pipeline 1");

    GstElement *conv2 = gst_bin_get_by_name(GST_BIN(pipeline), "conv2");
    if (conv2 == NULL)
        g_error ("Failed to find conv in pipeline 2");

    GstElement *conv3 = gst_bin_get_by_name(GST_BIN(pipeline), "conv3");
    if (conv3 == NULL)
        g_error ("Failed to find conv in pipeline 3");

    size_t final_frame_size = (512 * 2) * (320 * 2) * 4;
    cudaError_t err = cudaHostAlloc((void**)&unified_final_buffer, final_frame_size, cudaHostAllocMapped);
    if (err != cudaSuccess) err = cudaMallocManaged(&unified_final_buffer, final_frame_size);
    if (err != cudaSuccess) {
        std::cerr << "Failed to allocate memory for final frame: " << cudaGetErrorString(err) << std::endl;
        return -1;
    }

    gpu_final_frame = cv::cuda::GpuMat(640, 1024, CV_8UC4, unified_final_buffer);
    std::cout << "gpu_final_frame allocated " << std::endl;

    output = videoOutput::Create("webrtc://@:8554/output");
    if (!output) {
        LogError("failed to create output stream\n");
        cudaFree(unified_final_buffer);
        return 1;
    }

    /* Get its src pad and add the probe to it */
    GstPad *pad0 = gst_element_get_static_pad (conv0, "src");
    gst_pad_add_probe (pad0, GST_PAD_PROBE_TYPE_BUFFER,(GstPadProbeCallback) conv_src_pad_buffer_probe, &id0, NULL);
    gst_object_unref (pad0);
    gst_object_unref (conv0);

    GstPad *pad1 = gst_element_get_static_pad (conv1, "src");
    /* gulong probe_id = */ gst_pad_add_probe (pad1, GST_PAD_PROBE_TYPE_BUFFER,(GstPadProbeCallback) conv_src_pad_buffer_probe, &id1, NULL);
    gst_object_unref (pad1);
    gst_object_unref (conv1);

    GstPad *pad2 = gst_element_get_static_pad (conv2, "src");
    /* gulong probe_id = */ gst_pad_add_probe (pad2, GST_PAD_PROBE_TYPE_BUFFER,(GstPadProbeCallback) conv_src_pad_buffer_probe, &id2, NULL);
    gst_object_unref (pad2);
    gst_object_unref (conv2);

    GstPad *pad3 = gst_element_get_static_pad (conv3, "src");
    /* gulong probe_id = */ gst_pad_add_probe (pad3, GST_PAD_PROBE_TYPE_BUFFER,(GstPadProbeCallback) conv_src_pad_buffer_probe, &id3, NULL);
    gst_object_unref (pad3);
    gst_object_unref (conv3);

    /* Try running the pipeline and wait until it's up and running or failed */
    gst_element_set_state (pipeline, GST_STATE_PLAYING);
    if (gst_element_get_state (pipeline, NULL, NULL, -1) == GST_STATE_CHANGE_FAILURE) {
       g_error ("Failed to go into PLAYING state");
    }

    g_main_loop_run (loop);

    //exit and clean up
    gst_element_set_state (pipeline, GST_STATE_NULL);
    gst_object_unref (pipeline);
    cudaFree(unified_final_buffer);
    SAFE_DELETE(output);

    return true;
}


int main (gint   argc,
      gchar *argv[])
{

    run_capture();

    return 0;
}

AastaLLL · July 9, 2025, 9:40am

Hi,

Could you try only doing the cuGraphicsEGL-related function in the initial and destroy?
In between using the sync context to see if it helps?

Thanks.

jbaldwin · July 9, 2025, 6:34pm

I’ve been having difficulty implementing the cuGraphicsEGL-related functions in the initial and destroy phases. It seams that when I try to cache the EGLImages I run into issues with cuGraphicsMapResources. I checked and each argus camera appears to be cycling through 3-4 buffer descripters (I guess Argus has a small pool of NVMM surfacess?). I tried creating a new EGLImage for each buffer and camera combination (say 12 in total) then using the appropriate one based on the egleImage, but this also fails at the resourceGetMapped step. (note for the first frame it works, its the second frame on each camera that fails).

So I think I need a little more help in understanding if I can avoid the cuGraphicsEGLRegisterImage() on every frame, and if so what do I need to do inorder to properly cache these resources for re-use. Perhaps their is some context switching I need to better manage

jbaldwin · July 9, 2025, 8:46pm

Okay, so I figured out why it was failing when creating an EGL image cache for each camera/buffer. I needed to call cudaFree(0) once in each thread (each pad probe callback) before I do any cuda access. Correcting this mistake the code runs, and only registers once per buffer per camera (in my case it looks like 4 buffers per camera so I end up with 16 sets of CUgraphicsResource).

Profiling I am still a little disapointed in cpu usage, it has gotten a little better but for some reason I hoped for more. Perhaps switching from the padProbe to an appsink callback itself might help? I was reading a bit about the extra overhead of thread management in the pad probe callback might hurt cpu ussage… Unsure, but hoping for some direction on where else I could look to gain some performance? Perhaps others have taken a different routes at reading in frames thats more efficient?

for those that are curious here is my slightly cleaned up (though still has a lot of timers) code:

#include <gst/gst.h>
#include <nvbufsurface.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cudaEGL.h>
#include <opencv2/core.hpp>
#include <opencv2/cudaarithm.hpp>

#include <jetson-utils/videoOutput.h>
#include <jetson-utils/cudaMappedMemory.h>
#include <jetson-utils/logging.h>

#include <vector>
#include <iostream>
#include <chrono>
#include <unordered_map>

// --- Global state ---
videoOutput* output = nullptr;
GMainLoop* loop = nullptr;
GstElement* pipeline = nullptr;
int sensor_ids[4] = {0, 1, 2, 3};

uchar *unified_final_buffer = nullptr;
cv::cuda::GpuMat gpu_final_frame;
cudaStream_t copy_streams[4];  // One stream per sensor

auto loop_start = std::chrono::high_resolution_clock::now();

// Map for persistent EGL to CUgraphicsResource
std::unordered_map<EGLImageKHR, CUgraphicsResource> egl_to_cu_resource;

void InitCudaContextOnce()
{
    // Create a thread-local flag to ensure we only init once per thread
    thread_local bool initialized = false;
    if (!initialized) {
        cudaFree(0);  // Forces primary context setup
        initialized = true;
    }
}

void CvCudaProcessEGLImage(EGLImageKHR egl_image, unsigned int pitch, int sensorID)
{
    using namespace std::chrono;

    auto t_start = high_resolution_clock::now();

    CUresult status;
    CUeglFrame eglFrame;
    CUgraphicsResource cu_res = nullptr;

    auto t_reg_start = high_resolution_clock::now();
    // Register only once
    auto it = egl_to_cu_resource.find(egl_image);
    if (it == egl_to_cu_resource.end()) {

        if (egl_image == EGL_NO_IMAGE_KHR || egl_image == nullptr) {
            g_printerr("Invalid EGLImage\n");
            return;
        }

        status = cuGraphicsEGLRegisterImage(&cu_res, egl_image, CU_GRAPHICS_MAP_RESOURCE_FLAGS_NONE);
        if (status != CUDA_SUCCESS) {
            const char* err_str = nullptr;
            cuGetErrorString(status, &err_str);
            g_printerr("cuGraphicsEGLRegisterImage failed: %s\n", err_str);
            return;
        }
        egl_to_cu_resource[egl_image] = cu_res;
    } else {
        cu_res = it->second;
    }
    auto t_reg_end = high_resolution_clock::now();

    auto t_map_start = high_resolution_clock::now();
    status = cuGraphicsResourceGetMappedEglFrame(&eglFrame, cu_res, 0, 0);
    if (status != CUDA_SUCCESS) {
        g_printerr("cuGraphicsResourceGetMappedEglFrame failed\n");
        return;
    }
    status = cuCtxSynchronize();  // Can be optimized later
    auto t_map_end = high_resolution_clock::now();

    auto t_copy_start = high_resolution_clock::now();
    cv::cuda::GpuMat d_mat(eglFrame.height, pitch / 4, CV_8UC4, (uchar*) eglFrame.frame.pPitch[0]);

    const int w = 512, h = 320;
    if (sensorID == 0) d_mat.copyTo(gpu_final_frame(cv::Rect(0, 0, w, h)));
    else if (sensorID == 1) d_mat.copyTo(gpu_final_frame(cv::Rect(w, 0, w, h)));
    else if (sensorID == 2) d_mat.copyTo(gpu_final_frame(cv::Rect(0, h, w, h)));
    else if (sensorID == 3) d_mat.copyTo(gpu_final_frame(cv::Rect(w, h, w, h)));
    auto t_copy_end = high_resolution_clock::now();

    auto t_render_start = high_resolution_clock::now();
    if (sensorID == 0 && output) {
        output->Render((uchar4*)unified_final_buffer, 1024, 640);
        auto end_time = high_resolution_clock::now();
        float fps = 1000.0f / duration_cast<milliseconds>(end_time - loop_start).count();
        //std::cout << "FPS: " << fps << std::endl;
        loop_start = end_time;
    }
    auto t_render_end = high_resolution_clock::now();

    // Total time
    auto t_end = high_resolution_clock::now();

    // Print profile results only for sensor 0 (or all if you prefer)
    /*if (sensorID == 0) {
        auto dur = [](auto a, auto b) { return duration_cast<microseconds>(b - a).count(); };

        std::cout << "[Profile] Sensor " << sensorID
                  << " | Register: " << dur(t_reg_start, t_reg_end) << "us"
                  << " | Map+Sync: " << dur(t_map_start, t_map_end) << "us"
                  << " | Copy: "     << dur(t_copy_start, t_copy_end) << "us"
                  << " | Render: "   << dur(t_render_start, t_render_end) << "us"
                  << " | Total: "    << dur(t_start, t_end) << "us"
                  << std::endl;
    }*/
}

static GstPadProbeReturn conv_src_pad_buffer_probe(GstPad *pad, GstPadProbeInfo *info, gpointer u_data)
{
    using namespace std::chrono;
    auto t_probe_start = high_resolution_clock::now();

    InitCudaContextOnce();  // ensures once-per-thread context init

    int sensorID = *(int*)u_data;
    GstBuffer *buffer = (GstBuffer *)info->data;
    GstMapInfo map = {0};

    auto t_map_start = high_resolution_clock::now();
    if (!gst_buffer_map(buffer, &map, GST_MAP_WRITE)) {
        g_printerr("Failed to map GstBuffer\n");
        return GST_PAD_PROBE_OK;
    }
    auto t_map_end = high_resolution_clock::now();

    NvBufSurface* surf = (NvBufSurface*)map.data;
    NvBufSurfaceParams& params = surf->surfaceList[0];

    auto t_eglmap_start = high_resolution_clock::now();
    if (NvBufSurfaceMapEglImage(surf, 0) != 0) {
        g_printerr("Failed to map surface to EGLImage\n");
        gst_buffer_unmap(buffer, &map);
        return GST_PAD_PROBE_OK;
    }
    auto t_eglmap_end = high_resolution_clock::now();

    EGLImageKHR egl_image = params.mappedAddr.eglImage;
    if (!egl_image) {
        g_printerr("No EGLImage mapped from surface\n");
        gst_buffer_unmap(buffer, &map);
        return GST_PAD_PROBE_OK;
    }

    auto t_cuda_start = high_resolution_clock::now();
    CvCudaProcessEGLImage(egl_image, params.pitch, sensorID);
    auto t_cuda_end = high_resolution_clock::now();

    auto t_eglunmap_start = high_resolution_clock::now();
    NvBufSurfaceUnMapEglImage(surf, 0);
    auto t_eglunmap_end = high_resolution_clock::now();

    gst_buffer_unmap(buffer, &map);
    auto t_probe_end = high_resolution_clock::now();

    /*if (sensorID == 0) {
        auto us = [](auto a, auto b) { return duration_cast<microseconds>(b - a).count(); };

        std::cout << "[Probe Profile] Sensor " << sensorID
                  << " | gst_map: "     << us(t_map_start, t_map_end) << "us"
                  << " | EGLMap: "      << us(t_eglmap_start, t_eglmap_end) << "us"
                  << " | CUDA proc: "   << us(t_cuda_start, t_cuda_end) << "us"
                  << " | EGLUnmap: "    << us(t_eglunmap_start, t_eglunmap_end) << "us"
                  << " | Total Probe: " << us(t_probe_start, t_probe_end) << "us"
                  << std::endl;
    }*/

    return GST_PAD_PROBE_OK;
}

bool run_capture()
{
    InitCudaContextOnce();
    gst_init(NULL, NULL);
    loop = g_main_loop_new(NULL, FALSE);

    std::string pipeline_str;
    for (int i = 0; i < 4; ++i) {
        pipeline_str += "nvarguscamerasrc sensor-id=" + std::to_string(i) +
                        " ! video/x-raw(memory:NVMM),width=1920,height=1080,framerate=30/1" +
                        " ! nvvidconv name=conv" + std::to_string(i) +
                        " ! video/x-raw(memory:NVMM),width=512,height=320,format=RGBA" +
                        " ! appsink max-buffers=1 drop=true emit-signals=true sync=false ";
    }

    pipeline = gst_parse_launch(pipeline_str.c_str(), nullptr);
    if (!pipeline) {
        g_printerr("Failed to create pipeline\n");
        return false;
    }

    for (int i = 0; i < 4; ++i) {
        GstElement* conv = gst_bin_get_by_name(GST_BIN(pipeline), ("conv" + std::to_string(i)).c_str());
        GstPad* pad = gst_element_get_static_pad(conv, "src");
        gst_pad_add_probe(pad, GST_PAD_PROBE_TYPE_BUFFER, conv_src_pad_buffer_probe, &sensor_ids[i], nullptr);
        gst_object_unref(pad);
        gst_object_unref(conv);
    }

    size_t final_frame_size = 1024 * 640 * 4;
    cudaError_t err = cudaHostAlloc((void**)&unified_final_buffer, final_frame_size, cudaHostAllocMapped);
    if (err != cudaSuccess)
        err = cudaMallocManaged(&unified_final_buffer, final_frame_size);

    if (err != cudaSuccess) {
        std::cerr << "Failed to allocate final frame buffer: " << cudaGetErrorString(err) << std::endl;
        return false;
    }

    gpu_final_frame = cv::cuda::GpuMat(640, 1024, CV_8UC4, unified_final_buffer);
    std::cout << "gpu_final_frame allocated\n";

    for (int i = 0; i < 4; ++i) {
        cudaStreamCreate(&copy_streams[i]);
    }

    output = videoOutput::Create("webrtc://@:8554/output");
    if (!output) {
        LogError("failed to create output stream\n");
        cudaFree(unified_final_buffer);
        return false;
    }

    gst_element_set_state(pipeline, GST_STATE_PLAYING);
    if (gst_element_get_state(pipeline, nullptr, nullptr, -1) == GST_STATE_CHANGE_FAILURE) {
        g_error("Failed to go into PLAYING state");
    }

    g_main_loop_run(loop);

    gst_element_set_state(pipeline, GST_STATE_NULL);
    gst_object_unref(pipeline);

    for (auto& kv : egl_to_cu_resource) {
        cuGraphicsUnregisterResource(kv.second);
    }
    egl_to_cu_resource.clear();

    cudaFree(unified_final_buffer);
    SAFE_DELETE(output);

    return true;
}

int main(int argc, char* argv[])
{
    return run_capture() ? 0 : 1;
}

AastaLLL · July 10, 2025, 7:13am

Hi,

Could you also share the elapsed time of each step in your testing?

It looks like there are some d_mat.copyTo call in the source.
Do these require due to ROI usage?

Thanks.

baldwj · July 10, 2025, 4:17pm

So for the code I shared here are the Timings:

I do have a different version where I do cudaAsync for copy the the images into my final frame which reduces the COPY: to sub 100 us. But this isn’t the final state of the code, other operations will occur, so I’m not overly interested in optimizing this portion of the code. I mainly want to ensure I’m getting data to a cv::cuda::GpuMat as efficiently as possible (the step before I start merging everything) what happens after that will change with my application specifics, for now I just have this combining of 4 frames to send out so I can verify all streams are working.

I think this is a fairly efficient pipeline, if we ignore the COPY and the Render cost, the bulk of time appears to be spent with EGLMap, EGLUnmap and only consume 0.9 ms per camera, which isn’t too bad. I am more so wondering about how I am probing the gstreamer with the pad probe. If there is a better way to setup my pipeline. Perhaps to access at the appsink instead (Some reading I was doing suggested there would be less overhead juggling threads at this level, but it wasn’t clear if this was true or if I was misinterpreting). Or some other approach I should look at?

AastaLLL · July 14, 2025, 8:47am

Hi,

Based on your profiling score:

EGLMap: 332us
EGLUnmap: 554us

This looks as expected to us.

Thanks.

baldwj · July 18, 2025, 2:41pm

Okay, so the bulk of what I was struggling with was resolved with AastaLLL’s first post (links to Honey’s script) along with Aastall’s suggestion attempting to only call some of the cuGraphicsEGL on init destroy. For how this was done I put a sample of my code in my other responses (for those that might be interested).

system · August 13, 2025, 12:51am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Opencv gpu mat into GStreamer without downloading to cpu Jetson Nano opencv , gstreamer	19	9240	October 13, 2021
OpenCV CUDA processing from gstreamer pipeline [JP4, JP5] Jetson AGX Orin opencv , gstreamer	3	1268	November 30, 2023
gstreamer NVMM <-> opencv gpuMat Jetson TX2 opencv	58	22573	October 18, 2021
Send OpenCV GpuMat to GStreamer pipeline without memory copy? Jetson TX2	12	4038	October 18, 2021
How to get a Cuda GpuMat into Gstreamer? Jetson TX2 cuda , gstreamer	13	3102	October 18, 2021
DeepStream zero-copy + run cuda kernel directly (Jetson Orin Nx) DeepStream SDK camera , cuda , kernel , gstreamer , cudnn , deepstream	9	266	August 22, 2025
NVMM memory Jetson TX1	36	16620	October 18, 2021
Error generated while running the code after connecting the camera Jetson Xavier NX gstreamer , nvbugs	45	1843	January 2, 2024
EGLStream(CUDA) -> cv::cuda::GpuMat using Argus & nppi Computer Vision & Image Processing opencv , cuda	16	2047	August 31, 2023
dps4 on xavier: failed to register EGLImage in cuda DeepStream SDK	16	2296	October 12, 2021

Gstreamer writing to CUDA memory and zero copy cv::cuda::GpuMat with Jetpack 5.1.2

Related topics