Deepstreamer Pipeline: Optimisation GPU Utilisation

carlin2 · December 6, 2024, 12:54am

Hi,

Requested Questions
**• Hardware Platform (Jetson / GPU): A3000x | Intended for multiple Jetson targets
• DeepStream Version: nvcr.io/nvidia/deepstream:6.2-triton AS builder and nvcr.io/nvidia/deepstream:6.2-base
• JetPack Version (valid for Jetson only):
• TensorRT Version: 8.5.2-1+cuda11.8
• CUDA Version: 11.8 (x86 target)
• NVIDIA GPU Driver Version (valid for GPU only): 535.183.01
• Issue Type( questions, new requirements, bugs): Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing):

Objective
Identify the bottleneck in my deepstreamer application relating to the GPU idle time. In addition to rectify (it if possible). Specifically as it relates to the utilisation.

Background
I am currently running a deepstreamer/gsteamer pipeline inside a docker container. This is my own customised app. This app run inference and tracking on people in a crowded place. I am trying to optimise the pipeline to increase frame rate, throughput, etc. From my nsight systems report there appear to be many periods where the GPU is doing nothing. Or, the hardware is being under utilised. This appears to be the case when I run it on one stream, or when I run it on ten streams. It does not appear to make a difference.

NSystems: 10 Streams - Multiple Gaps Between Processing

NSystems: 1 Stream - Multiple Gaps Between Processing

I am unable to attach both of the reports as they exceed the 100MB limit. You can download the reports here:
NSys Reports Folder

Setup
I have nsight systems installed onto my computer, I mount the nsys folder from my /opt/nvidia… into the container.

I run the container I run nsys on my deepstreamer application, providing the application with either 1 stream or 10 streams.

These streams are provide frames via file. For one stream the pipeline has ~120 fps for one stream. And around ~12 fps per stream for ten streams. (Not in debugging mode).

If I remove tracking from the pipeline it can get up to 600fps. So, I think its safe to assume that the IO relating to the frame loading is not a bottleneck. I am using the standard nvtracker and nvinference modules.

Note
I have followed guides provided by NVIDIA, for increasing batch size, modifying configurations for the tracker and inference parameters and have seen improvements. This is outside the scope of my concern here.

Please find attached the nsystems report for the 1 stream and 10 streams.

Concern
I am concerned that possibly I am running tracking on the multiplexed frames Multiplexed frame → Inference → Tracking. Possibly it is faster and enables greater throughput if I split each of the streams to have their on inference and tracking steps rather than combining the steams?

Maybe this is only related to the profiler?

Please let me know if there is anything further that I can provide you.

Thank you, I appreciate you taking the time to read me request.

Fiona.Chen · December 6, 2024, 6:22am

Please provide the complete pipeline and configurations. It is better to provide all details about your pipeline. E.G. As the details in Performance — DeepStream documentation

Since your question is general, what we can provide is some general guidance. Troubleshooting — DeepStream documentation

carlin2 · December 8, 2024, 9:50pm

Hi Fiona,

Thank you for getting back to me.

Below is a graphic of the full pipeline generated by gstreamers debug output dot file. I will follow up with additional information shortly.

Fiona.Chen · December 9, 2024, 1:55am

Please remove the nvvideotemplate from the pipeline and replace the udpsink with fakesink(sync=0), please check whether the GPU idle will decease with these changes.

carlin2 · December 10, 2024, 12:56am

Methodology

I have run the code 10 separate times in two separate batches. Five runs for the proposed pipeline and five runs for the original pipeline previously mentioned. The configuration can code input streams and code is identical for each runs in a batch.

Proposed changes:

remove the nvvideotemplate from the pipeline and replace the udpsink with fakesink(sync=0)

Runs

Export GStreamer pipeline PNG.
Provide and extract FPS data for 1 stream (compile optimised).
Provide and extract FPS data for 10 streams (compile optimised).
Run applications through NSYS providing 1 stream (included debug symbols & compile optimised).
Run applications through NSYS providing 10 streams (included debug symbols & compile optimised).

FPS

The FPS data was extracted by adding a probe at the end of the pipeline. It prints to standard our each time a frame from one stream reaches the end. The logs from the docker container are parsed and the following

Results

Pipelines

Original Pipeline

Optimised Pipeline

FPS

	Num Streams	FPS
Original	1	129
Optimised	1	117
Original	10	16
Optimised	10	16

NSYS

Original 1 Stream

Optimised 1 Stream

Original 10 Streams

Optimised 10 Streams

Conclusion

The adjustments to the pipeline saw performance non-significant improvements. These parts of the pipeline will likely need to be added. In our case the current functionality used from nvtemplatevideo will be relocated, and this removed (likely). The rtsp sink will be needed in production. The gains have been relatively marginal. The changes have failed to address the issue of the GPU utilisation as evidenced by the NSYS reports.

Discussion

Removing parts from the pipeline seemed to have no significant improvement in the pipeline. The GPU still appears to be under-utilized, idling half the time.

I suspect the NVTracker process is the bottleneck. I am assuming that all prior buffers are full (inferencing etc), but tracking requires sequential processing, likely involving both GPU and CPU. This causes the tracker to run continuously, fully queuing everything behind it and stalling the pipeline while waiting for NVTracker to complete. This ongoing bottleneck may be preventing full GPU utilization. As all components except for the tracker can be processed in any order.

I’d love to hear your thoughts @Fiona.Chen on this matter :-)

Thank you for taking the time to review my post and providing the previous suggestions, I appreciate you taking the time to review and respond to me.

Fiona.Chen · December 10, 2024, 1:21am

Please use trtexec to measure the performance of your model. I notice you used batch size 1 engine in your pipeline.

Please also use the “nvidia-smi dmon” command to monitor the GPU performance when you running the optimized pipeline.

carlin2 · December 10, 2024, 2:13am

I will give this a go. But, I have not noticed any improvement when adjusting the batch size of the model. Thank you for your input.

Fiona.Chen · December 10, 2024, 2:52am

Please tell us your test video’s properties, such as resolution, codec format(H264, h265,…) and video FPS.

carlin2 · December 10, 2024, 5:07am

There are 4 videos similar to this, ~1500 frames.

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'pedestrians_1_1min.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.29.100
    comment         : {"dewarpingParams":{"cameraProjection":"equidistant","enabled":false,"fovRot":0,"hStretch":1,"radius":0.5,"sphereAlpha":0,"sphereBeta":0,"viewMode":"ceiling","xCenter":0.5,"yCenter":0.5},"encryptionData":[],"integrityHash":"","metadataStreamVersion":0,"ov
  Duration: 00:01:00.00, start: 0.000000, bitrate: 6896 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 1920x1080, 6894 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default)
    Metadata:
      handler_name    : VideoHandler

Stream 1 → pedestrians_1_1min.mp4
Stream 2 → pedestrians_2_1min.mp4
Stream 3 → pedestrians_3_1min.mp4
Stream 4 → pedestrians_4_1min.mp4
Stream 5 → pedestrians_1_1min.mp4
…

Fiona.Chen · December 10, 2024, 5:19am

Please also use the “nvidia-smi dmon” command to monitor the GPU performance when you running the optimized pipeline.
Please use trtexec to measure the performance of your model.

carlin2 · December 11, 2024, 1:23am

Please find attached the requested artifacts.

nvidia-smi dmon
1 stream: nvdia-smi.txt (3.5 KB)
2 streams: nvdia-smi.txt (6.6 KB)
5 streams: nvdia-smi.txt (6.3 KB)
10 streams: nvdia-smi.txt (9.9 KB)
12 streams: nvdia-smi.txt (11.8 KB)

nvidia-smi: GPU Details
nvidia-smi-gpu-details.txt (77 Bytes)

Tenexec Files
Code run inside the container to generate the files.

dpkg-query -W | grep nvinfer && /usr/src/tensorrt/bin/trtexec --version | tee /output/trtexec_version.txt

echo "=== GPU Details ==="
nvidia-smi --query-gpu=name,memory.total --format=csv | tee /output/nvidia-smi-gpu-details.txt

for engine_file in /app/deepstream_src/models/*.engine; do
    echo "=== Benchmarking $engine_file ==="
    engine_name=$(basename "$engine_file" .engine)

    /usr/src/tensorrt/bin/trtexec --loadEngine="$engine_file" --iterations=100 --workspace=1024 --verbose | tee "/output/${engine_name}_trtexec_basic_profile.txt"

    /usr/src/tensorrt/bin/trtexec --loadEngine="$engine_file" --iterations=100 --dumpProfile --workspace=1024 --verbose | tee "/output/${engine_name}_trtexec_layerwise_profile.txt"

    /usr/src/tensorrt/bin/trtexec --loadEngine="$engine_file" --iterations=100 \
        --exportTimes="/output/${engine_name}_times.json" \
        --exportProfile="/output/${engine_name}_profile.json" \
        --exportLayerInfo="/output/${engine_name}_layer_info.json" \
        --workspace=1024 --verbose

    echo "=== Benchmarking completed for $engine_file ==="
done

head_detector_29_3.onnx_b1_gpu0_fp16_profile.json.txt (24.0 KB)
head_detector_29_3.onnx_b1_gpu0_fp16_layer_info.json.txt (9.8 KB)
head_detector_29_3.onnx_b1_gpu0_fp16_trtexec_layerwise_profile.txt (43.4 KB)
head_detector_29_3.onnx_b1_gpu0_fp16_trtexec_basic_profile.txt (49.4 KB)

Pipeline: 2 streams

Additional:
head_detector_29_3.onnx_b1_gpu0_fp16_work.txt (44.3 KB)

carlin2 · December 11, 2024, 5:23am

I have recompiled my weights → onnx → engine to have dynamic batch sizes of 10 & support dynamic batching.

I have not seen any performance or utilization improvements.

I tried to doubly confirm it was running with a different batch size by adding a probe into the pipeline to print out the batch size for each batch.

    NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta(buffer);
    if (batch_meta) {
        g_print("Batch size being processed: %u\n", batch_meta->num_frames_in_batch);

stdout

deepstream-1  | Batch size being processed: 10
deepstream-1  | Batch size being processed: 1
deepstream-1  | Batch size being processed: 10

Here was the NSYS report, appear no different.

NVIDIA SMI logs:
nvdia-smi.txt (20.3 KB)

Fiona.Chen · December 11, 2024, 5:32am

Can you provide the complete nsys log file?

carlin2 · December 11, 2024, 5:44am

@Fiona.Chen Please find the zip file containing the NSYS report. This report was generated by running the “optimised” pipeline with 10 streams, using a batch size of 10, including a buffer probe printing out the number of items in the batch, compiled with debug (-G, etc) and optimisation flags.

nsys_report.zip (89.4 MB)

Thank you @Fiona.Chen I appreciate you taking the time to review my concern. Please let me know if there is anything further I can provide you with.

Fiona.Chen · December 11, 2024, 5:48am

Can you disable the probe function? This is a block callback which may impact the pipeline speed.

carlin2 · December 11, 2024, 6:25am

Please find the zip file containing the NSYS report. This report was generated by running the “optimised” pipeline with 10 streams, using a batch size of 10, compiled with debug (-G, etc) and optimisation flags.

Full file here:

Additional the file broken up into parts here:
nsys_report.part.001.gz (57.8 MB)
nsys_report.part.000.gz (80 MB)

# Combine
 cat nsys_report.part.*.gz > nsys_report.nsys-rep

# Expected md5sum
md5sum nsys_report.nsys-rep
2e5a788f51d3b1b48ae4e4fd38ea1a3e  nsys_report.nsys-rep

Fiona.Chen · December 11, 2024, 6:52am

What kind of “optimisation”?

Fiona.Chen · December 11, 2024, 6:54am

A simpler way to identify the problem is to deploy your model with deepstream-app sample. Then we will be aligned by the same app.

Fiona.Chen · December 11, 2024, 7:00am

If your model support dynamic batch, it is better to set the batch size the same as the source number in the nvinfer configure file(not the nvstreammux config). Please provide the onnx model and nvinfer configuration file. What will the model detect？

carlin2 · December 11, 2024, 9:07pm

COMPILE_OPTIMISED=1
INCLUDE_DEBUG_SYMBOLS=1

ifeq ($(COMPILE_OPTIMISED), 1)
	CFLAGS += -O3
	CUFLAGS += -O3 -DNDEBUG
else
	CFLAGS += -O0 -ffloat-store -fno-fast-math
endif

ifeq ($(INCLUDE_DEBUG_SYMBOLS), 1)
    CFLAGS += -g3 -DDEBUG_FLAG -fno-omit-frame-pointer -gdwarf-4
    CUFLAGS += -g -lineinfo -G
endif

Topic		Replies	Views
High CPU and low GPU utilization on Ubuntu 18.04, RTX2080. How to improve GPU utilization? DeepStream SDK performance	5	2415	June 2, 2020
Some problems in deepstream_parallel_app DeepStream SDK	30	1507	November 3, 2023
Increase the FPS DeepStream SDK	25	1895	April 17, 2024
DeepStream Python SSD : Not utilising GPU and it is slow DeepStream SDK	4	614	October 12, 2021
How to improve the performance of pipelines in deepstream DeepStream SDK deepstream	26	414	October 21, 2025
Deepstream-server FPS Keep Falling After A While DeepStream SDK deepstream	40	395	December 4, 2025
Low performance when running pipeline with RTX 4090 DeepStream SDK	24	859	March 21, 2024
Analyzing latency using Nsight systems in deepstream DeepStream SDK jetson , deepstream	30	460	April 14, 2025
GPU utilization rate and FPS are very low in deepstream samples DeepStream SDK	3	710	October 12, 2021
Inference FLickers on Nvstreeammux Batch-size increase to number of streams DeepStream SDK deepstream	43	417	September 30, 2025