High Cpu usage and low framerates using Deepstream

scharf1 · June 24, 2025, 10:17am

— Xavier AGX, DS 6.0.1, JetPack 4.6.1 —

Hello everyone,
I’m experiencing high CPU usage and low framerates when processing images from multiple cameras using DeepStream 6.0.1.
The pipelines structure involves the following:
nvvideoconvert nvbuf-memory-type=4 (surface array memory) ! video/x-raw(memory:NVMM), width=1920, height=1080, framerate=35/1, format=BGRx ! appsink sync=false drop-buffers=true, max-buffers=1 emit-signals=1

In the new-sample callback function the nvmm-buffer is mapped to a NvBufsurface like this:

static GstFlowReturn on_new_image(GstElement* sink, void* user_data)
        {
            if(nullptr != user_data && NULL != user_data )
            {
                CImageProcessingCamera* receivingCamera = reinterpret_cast<CImageProcessingCamera*>(user_data);

                g_signal_emit_by_name(sink, "pull-sample", &sample, NULL);

                if(NULL != sample)
                {
                    GstBuffer* buffer = gst_sample_get_buffer(sample);

                    GstMapInfo info;
                    if(gst_buffer_map(buffer, &info, GST_MAP_READ))
                    {
                        NvBufSurface *surface = (NvBufSurface *)info.data;
                        NvBufSurfaceMap(surface, 0, 0, NVBUF_MAP_READ);
                        NvBufSurfaceSyncForDevice(surface, 0, 0);

                        std::unique_lock<std::shared_mutex> lockOnCameraFrame(receivingCamera->mGpuFrame.mMutex);
                        cudaMemcpy(receivingCamera->mGpuFrame.mGpuImage.data, surface->surfaceList[0].mappedAddr.addr[0], receivingCamera->mImageMemorySize, cudaMemcpyKind::cudaMemcpyDefault);
                        lockOnCameraFrame.unlock();
                        
                        NvBufSurfaceUnMap(surface, 0, 0);
                        gst_buffer_unmap(buffer, &info);
                        gst_sample_unref(sample);
                    }
                    ...
                }
                ....
            }
...
return GST_FLOW_OK;

When I run the pipeline without the callback, the CPU load is very low. The high CPU load seems to occur as soon as we grab the sample in the callback, and the other contributing factor seems to be cudamemcpy.
Cudamemcpy copies the image data to a shared memory space between the CPU and GPU. Then, each image is immediatly processed further. I’d prefer to continue copying toward the shared memory.
So far, running this setup with four cameras results in most cores being almost fully utilized.
Is there something I’m missing? Is there a way to reduce the CPU usage? Any help would be appreciated. :)

Fiona.Chen · June 25, 2025, 5:34am

What is this?

scharf1 · July 1, 2025, 6:30am

Hello,
it is a pinned, mapped (zero-copy) memory buffer. I set it up this way to avoid additional CPU-to-GPU memory transfers, since the next processing steps are all done on the GPU.

Fiona.Chen · July 1, 2025, 7:03am

You can comment out the line of “cudaMemcpy” to check whether the CPU loading is high or not first. Theoretically the GPU to GPU copy takes no CPU workload. We don’t know what else you have done inside the callback, you need to check them one by one to find out which processing consumed the CPU so much.

scharf1 · July 1, 2025, 10:37am

I commented out cudamemcpy, but the CPU usage did not change significantly. However, the load decreased by about 95% when I removed the callback altogether while keeping the four camera pipelines active. Additionally, when I use the callback with just these three lines (pulling the sample and unreferencing) , the CPU load average as indicated by htop increases from 0.2 to about 5:

            GstSample* sample = NULL;
            g_signal_emit_by_name(sink, "pull-sample", &sample, NULL);
            gst_sample_unref(sample);

Do you think combining each separate camera pipeline into one pipeline using the DeepStream plugin, nvStreamMux, could reduce CPU usage? If I understand correctly, then only one sample is pulled every 28.5 ms (frame rate = 1/35), instead of four.

Fiona.Chen · July 1, 2025, 10:55am

Do you have several cameras and each camera is handled by an independent pipeline? What is your complete use case?

scharf1 · July 1, 2025, 11:21am

Yes, there is currently one pipeline for each camera. As for the use case, I am developing a vision system for my car.

Fiona.Chen · July 2, 2025, 1:17am

We are not sure. The root cause of the CPU loading increasing heavily is not clear yet.

Topic		Replies	Views
High CPU usage when use Deepstream SDK 6.0 DeepStream SDK	4	872	January 4, 2022
CPU utilization using deepstream on Jetson AGX DeepStream SDK	5	287	March 18, 2024
I use deepstream to implement RTMP streaming, and the CPU usage is very high DeepStream SDK	7	289	June 11, 2024
DeepStream for multiple cameras DeepStream SDK camera , opencv , gstreamer	2	920	April 12, 2023
Gstreamer / DeepStream nvcamerasrc has high CPU usage (around 60 %) Jetson TX1	6	591	October 18, 2021
Gstreamer deepstream using too much cpu DeepStream SDK	2	510	October 12, 2021
Deep stream cpp increasing latency when using 2 cameras DeepStream SDK	2	335	February 28, 2024
GPU Context Switching Issue in DeepStream with CuPy Post-Processing DeepStream SDK cuda , cupy , jetson , deepstream	21	207	March 10, 2025
Sudden increase in CPU consumption DeepStream SDK deepstream	8	92	December 25, 2024
Copy DeepStream frame to cv2.cuda_GpuMat object DeepStream SDK	9	873	April 4, 2023

High Cpu usage and low framerates using Deepstream

Related topics