High Cpu usage and low framerates using Deepstream

— Xavier AGX, DS 6.0.1, JetPack 4.6.1

Hello everyone,
I’m experiencing high CPU usage and low framerates when processing images from multiple cameras using DeepStream 6.0.1.
The pipelines structure involves the following:
nvvideoconvert nvbuf-memory-type=4 (surface array memory) ! video/x-raw(memory:NVMM), width=1920, height=1080, framerate=35/1, format=BGRx ! appsink sync=false drop-buffers=true, max-buffers=1 emit-signals=1

In the new-sample callback function the nvmm-buffer is mapped to a NvBufsurface like this:

static GstFlowReturn on_new_image(GstElement* sink, void* user_data)
        {
            if(nullptr != user_data && NULL != user_data )
            {
                CImageProcessingCamera* receivingCamera = reinterpret_cast<CImageProcessingCamera*>(user_data);

                g_signal_emit_by_name(sink, "pull-sample", &sample, NULL);

                if(NULL != sample)
                {
                    GstBuffer* buffer = gst_sample_get_buffer(sample);

                    GstMapInfo info;
                    if(gst_buffer_map(buffer, &info, GST_MAP_READ))
                    {
                        NvBufSurface *surface = (NvBufSurface *)info.data;
                        NvBufSurfaceMap(surface, 0, 0, NVBUF_MAP_READ);
                        NvBufSurfaceSyncForDevice(surface, 0, 0);

                        std::unique_lock<std::shared_mutex> lockOnCameraFrame(receivingCamera->mGpuFrame.mMutex);
                        cudaMemcpy(receivingCamera->mGpuFrame.mGpuImage.data, surface->surfaceList[0].mappedAddr.addr[0], receivingCamera->mImageMemorySize, cudaMemcpyKind::cudaMemcpyDefault);
                        lockOnCameraFrame.unlock();
                        
                        NvBufSurfaceUnMap(surface, 0, 0);
                        gst_buffer_unmap(buffer, &info);
                        gst_sample_unref(sample);
                    }
                    ...
                }
                ....
            }
...
return GST_FLOW_OK;

When I run the pipeline without the callback, the CPU load is very low. The high CPU load seems to occur as soon as we grab the sample in the callback, and the other contributing factor seems to be cudamemcpy.
Cudamemcpy copies the image data to a shared memory space between the CPU and GPU. Then, each image is immediatly processed further. I’d prefer to continue copying toward the shared memory.
So far, running this setup with four cameras results in most cores being almost fully utilized.
Is there something I’m missing? Is there a way to reduce the CPU usage? Any help would be appreciated. :)

What is this?

Hello,
it is a pinned, mapped (zero-copy) memory buffer. I set it up this way to avoid additional CPU-to-GPU memory transfers, since the next processing steps are all done on the GPU.

You can comment out the line of “cudaMemcpy” to check whether the CPU loading is high or not first. Theoretically the GPU to GPU copy takes no CPU workload. We don’t know what else you have done inside the callback, you need to check them one by one to find out which processing consumed the CPU so much.

I commented out cudamemcpy, but the CPU usage did not change significantly. However, the load decreased by about 95% when I removed the callback altogether while keeping the four camera pipelines active. Additionally, when I use the callback with just these three lines (pulling the sample and unreferencing) , the CPU load average as indicated by htop increases from 0.2 to about 5:

            GstSample* sample = NULL;
            g_signal_emit_by_name(sink, "pull-sample", &sample, NULL);
            gst_sample_unref(sample);

Do you think combining each separate camera pipeline into one pipeline using the DeepStream plugin, nvStreamMux, could reduce CPU usage? If I understand correctly, then only one sample is pulled every 28.5 ms (frame rate = 1/35), instead of four.

Do you have several cameras and each camera is handled by an independent pipeline? What is your complete use case?

Yes, there is currently one pipeline for each camera. As for the use case, I am developing a vision system for my car.

We are not sure. The root cause of the CPU loading increasing heavily is not clear yet.