Very long pthread_mux_lock du gstreamer nvosd

For the context

I have a videopipeline using an accelerated Gstreamer pipeline

The display side of the pipeline has nvstreammux (for the purpose on injecting detection I’m not using nvinfer) and nvosd(for the purpose of render the detections). The pipeline runs OK.

Performance in term fps is very irregular and I have profiles using nsys.
What nsys highlight is that during most of the osd execusion a pthread_mux_lock bloc the execusion for a long time between 10ms and 20ms.
It happens even if no dections.

=================================
ByteTrack initialized with: fps=30, track_buffer=30, track_thresh=0.50, high_thresh=0.60, match_thresh=0.80
input source /dev/video0
Creating GStreamer capture for device: /dev/video0
gst_capture_create: device=/dev/video0, size=640x480, fps=30
DEBUG: Pipeline created: 0xaaaae973a200
Creating pipeline: v4l2src device=/dev/video0 ! video/x-raw,format=YUY2,width=640,height=480,framerate=30/1 ! nvvidconv ! video/x-raw(memory:NVMM),format=RGBA ! tee name=t t. ! queue ! appsink name=sink t. ! queue name=mux_queue nvstreammux name=stream-muxer width=640 height=480 batch-size=1 batched-push-timeout=33333 ! nvdsosd name=osd display-bbox=true display-clock=false process-mode=1 ! nv3dsink sync=false
Configuring appsink...
Starting pipeline...
Data flow probe attached to appsink
Data flow probe attached to mux_queue
Detection injection probe attached to nvstreammux src pad
Data flow and metadata extraction probes attached to OSD sink pad
Successfully linked mux_queue to nvstreammux sink_0
Pipeline started successfully, waiting for frames...
Checking for pipeline messages...
GStreamer capture created successfully
DEBUG: Data flowing through mux_queue, buffer count: 30
DEBUG: Data flowing through osd_sink, buffer count: 60
DEBUG: Data flowing through nvstreammux_src, buffer count: 90
DEBUG: Data flowing through osd_sink, buffer count: 120
DEBUG: Data flowing through mux_queue, buffer count: 150
DEBUG: Data flowing through osd_sink, buffer count: 180
DEBUG: Data flowing through mux_queue, buffer count: 210
DEBUG: Data flowing through osd_sink, buffer count: 240
DEBUG: Data flowing through mux_queue, buffer count: 270
DEBUG: Data flowing through appsink, buffer count: 300
DEBUG: Data flowing through nvstreammux_src, buffer count: 330
DEBUG: Data flowing through nvstreammux_src, buffer count: 360

I’m attaching the profiling file
case2.zip (1.9 MB)

Jetson Orin Nano dev kit
CUDA: 12.6.68
cuDNN: 9.3.0.75
TensoRT: 10.3.0.30

Package: nvidia-jetpack
Source: nvidia-jetpack (6.2.1)
Version: 6.2.1+b38
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-jetpack-runtime (= 6.2.1+b38), nvidia-jetpack-dev (= 6.2.1+b38)
Homepage: Jetson - Embedded AI Computing Platform | NVIDIA Developer
Priority: standard
Section: metapackages
Filename: pool/main/n/nvidia-jetpack/nvidia-jetpack_6.2.1+b38_arm64.deb
Size: 29300
SHA256: dd9cb893fbe7f80d2c2348b268f17c8140b18b9dbb674fa8d79facfaa2050c53
SHA1: dc630f213f9afcb6f67c65234df7ad5c019edb9c
MD5sum: 9c8dc61bdab2b816dcc7cd253bcf6482
Description: NVIDIA Jetpack Meta Package
Description-md5: ad1462289bdbc54909ae109d1d32c0a8

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context and the question, I’ll provide a comprehensive answer.

Issue Summary

You’re experiencing irregular performance (FPS) in your accelerated GStreamer pipeline on the Jetson Orin Nano dev kit. The pipeline uses nvstreammux for injecting detections, but not nvinfer and nvosd for rendering detections. Profiling with nsys reveals that a pthread mutex lock is blocking execution for a long time (10-20ms) during OSD execution, even when no detections are present.

Possible Causes

  1. Mutex contention: The pthread mutex lock is likely causing contention between threads, leading to the observed blocking behavior. This could be due to the way your pipeline is structured or the specific components used.
  2. nvstreammux: Although you’re not using nvinfer and nvosd, nvstreammux might still be contributing to the issue. It’s possible that the component is not optimized for your specific use case or is experiencing some internal contention.
  3. System resource constraints: The Jetson Orin Nano dev kit has limited resources (e.g., CPU, memory, and GPU). If your pipeline is not properly optimized or is consuming excessive resources, it could lead to performance issues.

Comparison with Similar Issues

The issue you’re facing is similar to the one described in the jetson_clocks utility documentation (issue 4391619). Although the specific issue is related to ResNet-10 batch_size=1 GPU-only inferencing, the workaround suggests using the jetson_clocks utility to max out the SOC clocks and speed up execution. This might be worth trying in your case, as it could help alleviate the mutex contention issue.

Recommendations

  1. Optimize pipeline structure: Review your pipeline structure and ensure that it’s optimized for your specific use case. Consider reordering components or using more efficient alternatives.
  2. Tune nvstreammux: Investigate nvstreammux’s configuration options and tune them to better suit your needs. You might need to adjust buffer sizes, queue lengths, or other parameters.
  3. Use jetson_clocks utility: Try using the jetson_clocks utility to max out the SOC clocks and speed up execution. This might help alleviate the mutex contention issue.
  4. Monitor system resources: Keep an eye on system resource utilization (e.g., CPU, memory, and GPU) to ensure that your pipeline is not consuming excessive resources.
  5. Profile and analyze: Continue profiling your pipeline with nsys and analyze the results to identify potential bottlenecks or areas for optimization.

Additional Information

The provided profiling file (case2.zip) is not accessible, so I couldn’t analyze it directly. However, based on your description, it seems that the pthread mutex lock is the primary cause of the performance issue.

The Jetson Orin Nano dev kit’s specifications and the installed software packages (e.g., CUDA, cuDNN, and TensorRT) are not directly relevant to the issue at hand. However, ensuring that your pipeline is optimized for the specific hardware and software configuration is crucial for achieving optimal performance.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

  1. Please replace ‘nvvidconv’ with ‘nvvideoconvert’
  2. Please add “live-source=True” with the nvstreammux plugin. Gst-nvstreammux — DeepStream documentation
  3. Please run the pipeline without “appsink” first to get the real performance data
  4. Can you tell us why don’t you use nvinfer to do the inferencing while you decide to use “nvstreammux” and “nvdsosd” to do the batching operations? What you have done with the GstBuffer you got from “appsink”?

Hi,

Replacing nvvidconv by nvvideoconvert create a huge problem of performance in the all pipeline.
1 -
I simple script for gst-launch shows that the gain of performance is 0 and that create conflict with the HW conflict between CUDA and TensorRT.
Why do you want me to change the nvvidconv ?
CPU usage is 2.8% in both cases but GPU/Orin HW is busy for 4ms in the nvvideoconvert. Run the script you will see.

#!/bin/bash
# Test pipeline to verify camera and display functionality

gst-launch-1.0 v4l2src device=/dev/video0 io-mode=4 ! \
  video/x-raw,format=YUY2,width=1280,height=720,framerate=30/1 ! \
  nvvidconv ! \
  video/x-raw\(memory:NVMM\),format=RGBA ! \
  nv3dsink sync=false

‘nvvidconv’ is not DeepStream element, please use ‘nvvideoconvert’ with other DeepStream elements. And for “v4l2src”, the dmabuf mode is not compatible to DeepStream elements either, the default is OK.

It is not clear, they are Gstreamer elements. Use my script which as nothing to do with deepstream and profile using Nsys and compare between nvvidconv and nvvideoconvert ! please

We do not support nvvidconv to run with DeepStream elements. It is no meaning to compare the two plugins.

Please tell us the result of the performance data of running the pipeline without “appsink” branch.

Please find nsys profiling file for nvvideoconvert and nvidconv.
thank you

nvvidconv-nsys.zip (6.7 MB)
nvideoconvert-nsys.zip (6.8 MB)

I have tried your pipeline and measure the FPS, it is quite stable.

The pipeline is as the following, our USB camera only supports 640x480@25fps:

gst-launch-1.0 --gst-debug=fpsdisplaysink:7 v4l2src device=/dev/video0 ! 'video/x-raw,format=YUY2,width=640,height=480,framerate=25/1' ! nvvideoconvert ! 'video/x-raw(memory:NVMM),format=RGBA' ! stream-muxer.sink_0 nvstreammux name=stream-muxer width=640 height=480 batch-size=1 batched-push-timeout=40000 ! nvdsosd name=osd display-bbox=true display-clock=false process-mode=1 ! fpsdisplaysink sync=false video-sink="nv3dsink" message-forward=TRUE text-overlay=FALSE signal-fps-measurements=TRUE sync=false

The FPS data when run for12 minutes:

The FPS when run for 31 minutes:

There is no issue with the basic pipeline.

Please run the same pipeline and get the data here.

Hi,

I think you are missing the point.

Please take a look at the nsys report attached, it is with the identical application only change is swap nvidconv with nvvideoconvert.

Focus between frame #134 and frame #145, the video camera scene in dark, the camera as a covert on it

report4_nvvideoconvert.zip (14.1 MB)

thks

I also added some screenshot with description and questions.

  1. The original post is talking about

But we don’t find the FPS issue with the DeepStream pipeline you post

  1. The comparison between ‘nvvidconv’ and ‘nvvideoconvert’ has nothing to do with your original issue. The two plugins are designed and implemented in different ways, it is no meaning to compare them in this way.

Can you use the pipeline we provided to measure the actual FPS?

Thank you for the feedback, but you are still missing the point, look at a the file big badaboom, it is exactly the same app, the only difference is swapping nvidconv with nvvideoconvert !! do you acknowledge the difference and the fact that it is not supposed to be like that !

thks

Do you mean only nvvideoconvert will cause the FPS irregular issue while nvvidconv will not?

I don’t know what you are talking about with the nsight reports. The NVTX marks are not provided by DeepStream SDK. Which part in the nsight graph do you think is for “YUV→RGBA” ? If you are talking about the duration of “gst_nvvideoconvert_transform()_ctx” in the nsight report, the duration may fluctuate due to the GPU loading in the moment.

Take the following different periods in your nsight report as the example:

This part shows when there is only “gst_nvvideoconvert_transform” using the GPU, the processing duration is about 4.275ms

This part shows when there is another heavy GPU comsuming task “naiveSlice“ running together with “gst_nvvideoconvert_transform”, the duration is about 23.667ms.

Seems you are using TensorRT directly to do the inferencing instead of using DeepStream nvinfer, is that right?

Hi,
Yes I’m using TensortRT to run some inference and sensor fusion.
The 3 screen captures:

1 - Nvvidconv:
The screen capture shows exactly the same app with Nvvidconv in the pipeline.
IT is OK regular and no jitter in the pipeline

2 - 3 are from the same Nsight profiling (attached)

2 - explain the pipeline , the result of local inference is merge with others detections and feed to the tracker with a delay of 10ms , then tracker to OSD.
Q: why the OSD is waiting for the end of the inference to process the frame. the frame are unmap and the display pipeline section is independant of the TensortRT inference.

3 - Big badaboom You can observe that in frame #136 both tensorrt inference and nvdsosd are waiting for each other.

Thks

What OSD do you mean? “nvdsosd” plugin?

What do you mean by “unmap” the frame? What do you do with the appsink?

why the OSD is waiting for the end of the inference to process the frame.

What OSD do you mean? “nvdsosd” plugin?

YES

the frame are unmap and the display pipeline section is independant of the TensortRT inference.

What do you mean by “unmap” the frame? What do you do with the appsink?

YES , Nvbufsurface

Do you mean “NvBufSurfaceUnMap”?

NvBufSurfaceMap and NvBufSurfaceUnMap just help you to get the readable memory address of the frame.

Why do you think the “NvBufSurfaceUnMap” will decouple your so called “display pipeline(DeepStream pipeline)” and TensorRT inference? We don’t know how did you get the NvBufSurface and how did you send the mapped frame buffers to the TensorRT code.

It is not reasonable for us to explain your implementation. We don’t know what you have done with your app.

Here is the callback.
Now I’m doing a copy of the NVMM into CPU at high cost to workaround the pb.
Thks to review and advise.

static GstFlowReturn on_new_sample_capture(GstAppSink *appsink, gpointer user_data) {
GstCapture capture = (GstCapture)user_data;
static int frame_count = 0;

GstSample *sample = gst_app_sink_pull_sample(appsink);
if (!sample) {
    return GST_FLOW_ERROR;
}

GstBuffer *buffer = gst_sample_get_buffer(sample);
if (!buffer) {
    gst_sample_unref(sample);
    return GST_FLOW_ERROR;
}

// Handle NVMM buffer
GstMapInfo map;
if (!gst_buffer_map(buffer, &map, GST_MAP_READ)) {
    gst_sample_unref(sample);
    return GST_FLOW_ERROR;
}

g_mutex_lock(&capture->frame_mutex);

// Extract NvBufSurface from NVMM buffer
NvBufSurface *surface = (NvBufSurface*)map.data;

if (surface && surface->numFilled > 0) {
    // Map the surface to get GPU memory access  
    if (NvBufSurfaceMap(surface, 0, 0, NVBUF_MAP_READ) == 0) {
        // Ensure CPU sees fresh data
        NvBufSurfaceSyncForCpu(surface, 0, 0);
        
        void *src = surface->surfaceList[0].mappedAddr.addr[0];
        size_t required_size = surface->surfaceList[0].height * surface->surfaceList[0].pitch;
        
        // Allocate host-pinned buffer if needed (simple approach for now)
        if (!capture->nvmm_buffer || capture->bufferSize < required_size) {
            if (capture->nvmm_buffer) {
                cudaFreeHost(capture->nvmm_buffer);
            }
            
            cudaError_t alloc_result = cudaHostAlloc(&capture->nvmm_buffer, required_size, 0);
            if (alloc_result != cudaSuccess) {
                NvBufSurfaceUnMap(surface, 0, 0);
                return GST_FLOW_ERROR;
            }
            capture->bufferSize = required_size;
        }
        
        // SAFE COPY: Copy immediately while buffer is guaranteed valid
        memcpy(capture->nvmm_buffer, src, required_size);
        
        // Store metadata
        capture->pitch = surface->surfaceList[0].pitch;
        capture->width = surface->surfaceList[0].width;
        capture->height = surface->surfaceList[0].height;
        
        // Unmap NVMM surface immediately - we no longer need it
        NvBufSurfaceUnMap(surface, 0, 0);
        
        // Mark frame ready for inference
        capture->frame_ready = TRUE;
        frame_count++;
    }
}

g_mutex_unlock(&capture->frame_mutex);

gst_buffer_unmap(buffer, &map);
gst_sample_unref(sample);
return GST_FLOW_OK;