Hi,
Requested Questions
**• Hardware Platform (Jetson / GPU): A3000x | Intended for multiple Jetson targets
• DeepStream Version: nvcr.io/nvidia/deepstream:6.2-triton AS builder and nvcr.io/nvidia/deepstream:6.2-base
• JetPack Version (valid for Jetson only):
• TensorRT Version: 8.5.2-1+cuda11.8
• CUDA Version: 11.8 (x86 target)
• NVIDIA GPU Driver Version (valid for GPU only): 535.183.01
• Issue Type( questions, new requirements, bugs): Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing):
Objective
Identify the bottleneck in my deepstreamer application relating to the GPU idle time. In addition to rectify (it if possible). Specifically as it relates to the utilisation.
Background
I am currently running a deepstreamer/gsteamer pipeline inside a docker container. This is my own customised app. This app run inference and tracking on people in a crowded place. I am trying to optimise the pipeline to increase frame rate, throughput, etc. From my nsight systems report there appear to be many periods where the GPU is doing nothing. Or, the hardware is being under utilised. This appears to be the case when I run it on one stream, or when I run it on ten streams. It does not appear to make a difference.
NSystems: 10 Streams - Multiple Gaps Between Processing
NSystems: 1 Stream - Multiple Gaps Between Processing
I am unable to attach both of the reports as they exceed the 100MB limit. You can download the reports here:
NSys Reports Folder
Setup
I have nsight systems installed onto my computer, I mount the nsys folder from my /opt/nvidia… into the container.
I run the container I run nsys on my deepstreamer application, providing the application with either 1 stream or 10 streams.
These streams are provide frames via file. For one stream the pipeline has ~120 fps for one stream. And around ~12 fps per stream for ten streams. (Not in debugging mode).
If I remove tracking from the pipeline it can get up to 600fps. So, I think its safe to assume that the IO relating to the frame loading is not a bottleneck. I am using the standard nvtracker and nvinference modules.
Note
I have followed guides provided by NVIDIA, for increasing batch size, modifying configurations for the tracker and inference parameters and have seen improvements. This is outside the scope of my concern here.
Please find attached the nsystems report for the 1 stream and 10 streams.
Concern
I am concerned that possibly I am running tracking on the multiplexed frames Multiplexed frame → Inference → Tracking. Possibly it is faster and enables greater throughput if I split each of the streams to have their on inference and tracking steps rather than combining the steams?
Maybe this is only related to the profiler?
Please let me know if there is anything further that I can provide you.
Thank you, I appreciate you taking the time to read me request.