Deepstreamer Pipeline: Optimisation GPU Utilisation

Hi,

Requested Questions
**• Hardware Platform (Jetson / GPU): A3000x | Intended for multiple Jetson targets
• DeepStream Version: nvcr.io/nvidia/deepstream:6.2-triton AS builder and nvcr.io/nvidia/deepstream:6.2-base
• JetPack Version (valid for Jetson only):
• TensorRT Version: 8.5.2-1+cuda11.8
• CUDA Version: 11.8 (x86 target)
• NVIDIA GPU Driver Version (valid for GPU only): 535.183.01
• Issue Type( questions, new requirements, bugs): Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing):

Objective
Identify the bottleneck in my deepstreamer application relating to the GPU idle time. In addition to rectify (it if possible). Specifically as it relates to the utilisation.

Background
I am currently running a deepstreamer/gsteamer pipeline inside a docker container. This is my own customised app. This app run inference and tracking on people in a crowded place. I am trying to optimise the pipeline to increase frame rate, throughput, etc. From my nsight systems report there appear to be many periods where the GPU is doing nothing. Or, the hardware is being under utilised. This appears to be the case when I run it on one stream, or when I run it on ten streams. It does not appear to make a difference.

NSystems: 10 Streams - Multiple Gaps Between Processing

NSystems: 1 Stream - Multiple Gaps Between Processing

I am unable to attach both of the reports as they exceed the 100MB limit. You can download the reports here:
NSys Reports Folder

Setup
I have nsight systems installed onto my computer, I mount the nsys folder from my /opt/nvidia… into the container.

I run the container I run nsys on my deepstreamer application, providing the application with either 1 stream or 10 streams.

These streams are provide frames via file. For one stream the pipeline has ~120 fps for one stream. And around ~12 fps per stream for ten streams. (Not in debugging mode).

If I remove tracking from the pipeline it can get up to 600fps. So, I think its safe to assume that the IO relating to the frame loading is not a bottleneck. I am using the standard nvtracker and nvinference modules.

Note
I have followed guides provided by NVIDIA, for increasing batch size, modifying configurations for the tracker and inference parameters and have seen improvements. This is outside the scope of my concern here.

Please find attached the nsystems report for the 1 stream and 10 streams.

Concern
I am concerned that possibly I am running tracking on the multiplexed frames Multiplexed frame → Inference → Tracking. Possibly it is faster and enables greater throughput if I split each of the streams to have their on inference and tracking steps rather than combining the steams?

Maybe this is only related to the profiler?

Please let me know if there is anything further that I can provide you.

Thank you, I appreciate you taking the time to read me request.

Please provide the complete pipeline and configurations. It is better to provide all details about your pipeline. E.G. As the details in Performance — DeepStream documentation

Since your question is general, what we can provide is some general guidance. Troubleshooting — DeepStream documentation

Hi Fiona,

Thank you for getting back to me.

Below is a graphic of the full pipeline generated by gstreamers debug output dot file. I will follow up with additional information shortly.

Please remove the nvvideotemplate from the pipeline and replace the udpsink with fakesink(sync=0), please check whether the GPU idle will decease with these changes.

1 Like

Methodology

I have run the code 10 separate times in two separate batches. Five runs for the proposed pipeline and five runs for the original pipeline previously mentioned. The configuration can code input streams and code is identical for each runs in a batch.

Proposed changes:

remove the nvvideotemplate from the pipeline and replace the udpsink with fakesink(sync=0)

Runs

  1. Export GStreamer pipeline PNG.
  2. Provide and extract FPS data for 1 stream (compile optimised).
  3. Provide and extract FPS data for 10 streams (compile optimised).
  4. Run applications through NSYS providing 1 stream (included debug symbols & compile optimised).
  5. Run applications through NSYS providing 10 streams (included debug symbols & compile optimised).

FPS

The FPS data was extracted by adding a probe at the end of the pipeline. It prints to standard our each time a frame from one stream reaches the end. The logs from the docker container are parsed and the following

Results

Pipelines

Original Pipeline

Optimised Pipeline

FPS

Num Streams FPS
Original 1 129
Optimised 1 117
Original 10 16
Optimised 10 16

NSYS

Original 1 Stream

Optimised 1 Stream

Original 10 Streams

Optimised 10 Streams

Conclusion

The adjustments to the pipeline saw performance non-significant improvements. These parts of the pipeline will likely need to be added. In our case the current functionality used from nvtemplatevideo will be relocated, and this removed (likely). The rtsp sink will be needed in production. The gains have been relatively marginal. The changes have failed to address the issue of the GPU utilisation as evidenced by the NSYS reports.

Discussion

Removing parts from the pipeline seemed to have no significant improvement in the pipeline. The GPU still appears to be under-utilized, idling half the time.

I suspect the NVTracker process is the bottleneck. I am assuming that all prior buffers are full (inferencing etc), but tracking requires sequential processing, likely involving both GPU and CPU. This causes the tracker to run continuously, fully queuing everything behind it and stalling the pipeline while waiting for NVTracker to complete. This ongoing bottleneck may be preventing full GPU utilization. As all components except for the tracker can be processed in any order.

I’d love to hear your thoughts @Fiona.Chen on this matter :-)

Thank you for taking the time to review my post and providing the previous suggestions, I appreciate you taking the time to review and respond to me.

Please use trtexec to measure the performance of your model. I notice you used batch size 1 engine in your pipeline.

Please also use the “nvidia-smi dmon” command to monitor the GPU performance when you running the optimized pipeline.

I will give this a go. But, I have not noticed any improvement when adjusting the batch size of the model. Thank you for your input.

Please tell us your test video’s properties, such as resolution, codec format(H264, h265,…) and video FPS.

There are 4 videos similar to this, ~1500 frames.

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'pedestrians_1_1min.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.29.100
    comment         : {"dewarpingParams":{"cameraProjection":"equidistant","enabled":false,"fovRot":0,"hStretch":1,"radius":0.5,"sphereAlpha":0,"sphereBeta":0,"viewMode":"ceiling","xCenter":0.5,"yCenter":0.5},"encryptionData":[],"integrityHash":"","metadataStreamVersion":0,"ov
  Duration: 00:01:00.00, start: 0.000000, bitrate: 6896 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 1920x1080, 6894 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default)
    Metadata:
      handler_name    : VideoHandler

Stream 1 → pedestrians_1_1min.mp4
Stream 2 → pedestrians_2_1min.mp4
Stream 3 → pedestrians_3_1min.mp4
Stream 4 → pedestrians_4_1min.mp4
Stream 5 → pedestrians_1_1min.mp4

Please also use the “nvidia-smi dmon” command to monitor the GPU performance when you running the optimized pipeline.
Please use trtexec to measure the performance of your model.

1 Like

Please find attached the requested artifacts.

nvidia-smi dmon
1 stream: nvdia-smi.txt (3.5 KB)
2 streams: nvdia-smi.txt (6.6 KB)
5 streams: nvdia-smi.txt (6.3 KB)
10 streams: nvdia-smi.txt (9.9 KB)
12 streams: nvdia-smi.txt (11.8 KB)

nvidia-smi: GPU Details
nvidia-smi-gpu-details.txt (77 Bytes)

Tenexec Files
Code run inside the container to generate the files.

dpkg-query -W | grep nvinfer && /usr/src/tensorrt/bin/trtexec --version | tee /output/trtexec_version.txt

echo "=== GPU Details ==="
nvidia-smi --query-gpu=name,memory.total --format=csv | tee /output/nvidia-smi-gpu-details.txt

for engine_file in /app/deepstream_src/models/*.engine; do
    echo "=== Benchmarking $engine_file ==="
    engine_name=$(basename "$engine_file" .engine)

    /usr/src/tensorrt/bin/trtexec --loadEngine="$engine_file" --iterations=100 --workspace=1024 --verbose | tee "/output/${engine_name}_trtexec_basic_profile.txt"

    /usr/src/tensorrt/bin/trtexec --loadEngine="$engine_file" --iterations=100 --dumpProfile --workspace=1024 --verbose | tee "/output/${engine_name}_trtexec_layerwise_profile.txt"

    /usr/src/tensorrt/bin/trtexec --loadEngine="$engine_file" --iterations=100 \
        --exportTimes="/output/${engine_name}_times.json" \
        --exportProfile="/output/${engine_name}_profile.json" \
        --exportLayerInfo="/output/${engine_name}_layer_info.json" \
        --workspace=1024 --verbose

    echo "=== Benchmarking completed for $engine_file ==="
done

head_detector_29_3.onnx_b1_gpu0_fp16_profile.json.txt (24.0 KB)
head_detector_29_3.onnx_b1_gpu0_fp16_layer_info.json.txt (9.8 KB)
head_detector_29_3.onnx_b1_gpu0_fp16_trtexec_layerwise_profile.txt (43.4 KB)
head_detector_29_3.onnx_b1_gpu0_fp16_trtexec_basic_profile.txt (49.4 KB)

Pipeline: 2 streams

Additional:
head_detector_29_3.onnx_b1_gpu0_fp16_work.txt (44.3 KB)

I have recompiled my weights → onnx → engine to have dynamic batch sizes of 10 & support dynamic batching.

I have not seen any performance or utilization improvements.

I tried to doubly confirm it was running with a different batch size by adding a probe into the pipeline to print out the batch size for each batch.

    NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta(buffer);
    if (batch_meta) {
        g_print("Batch size being processed: %u\n", batch_meta->num_frames_in_batch);

stdout

deepstream-1  | Batch size being processed: 10
deepstream-1  | Batch size being processed: 1
deepstream-1  | Batch size being processed: 10

Here was the NSYS report, appear no different.

NVIDIA SMI logs:
nvdia-smi.txt (20.3 KB)

Can you provide the complete nsys log file?

@Fiona.Chen Please find the zip file containing the NSYS report. This report was generated by running the “optimised” pipeline with 10 streams, using a batch size of 10, including a buffer probe printing out the number of items in the batch, compiled with debug (-G, etc) and optimisation flags.

nsys_report.zip (89.4 MB)

Thank you @Fiona.Chen I appreciate you taking the time to review my concern. Please let me know if there is anything further I can provide you with.

Can you disable the probe function? This is a block callback which may impact the pipeline speed.

1 Like

Please find the zip file containing the NSYS report. This report was generated by running the “optimised” pipeline with 10 streams, using a batch size of 10, compiled with debug (-G, etc) and optimisation flags.

Full file here:

Additional the file broken up into parts here:
nsys_report.part.001.gz (57.8 MB)
nsys_report.part.000.gz (80 MB)

# Combine
 cat nsys_report.part.*.gz > nsys_report.nsys-rep

# Expected md5sum
md5sum nsys_report.nsys-rep
2e5a788f51d3b1b48ae4e4fd38ea1a3e  nsys_report.nsys-rep

What kind of “optimisation”?

A simpler way to identify the problem is to deploy your model with deepstream-app sample. Then we will be aligned by the same app.

If your model support dynamic batch, it is better to set the batch size the same as the source number in the nvinfer configure file(not the nvstreammux config). Please provide the onnx model and nvinfer configuration file. What will the model detect?

COMPILE_OPTIMISED=1
INCLUDE_DEBUG_SYMBOLS=1

ifeq ($(COMPILE_OPTIMISED), 1)
	CFLAGS += -O3
	CUFLAGS += -O3 -DNDEBUG
else
	CFLAGS += -O0 -ffloat-store -fno-fast-math
endif

ifeq ($(INCLUDE_DEBUG_SYMBOLS), 1)
    CFLAGS += -g3 -DDEBUG_FLAG -fno-omit-frame-pointer -gdwarf-4
    CUFLAGS += -g -lineinfo -G
endif