Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) A30
• DeepStream Version 8.0
• TensorRT Version 10.0
**• Issue: Unable to reach HigherFPS with Current Engine, even with higher nvinfer batch size.
Title: How to configure TensorRT/DeepStream batch size to maximize throughput (>1300 FPS target)
Hello,
I am trying to optimize inference throughput for my ONNX model integrated into DeepStream. My goal is to understand how TensorRT engine configuration (min/opt/max batch size, streams) and DeepStream nvinfer batch size relate to achieving higher FPS, ideally above 1300 FPS.
DeepStream nvinfer configuration
Below is my current ds_demux_pgie_config.txt:
[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
model-color-format=0
onnx-file=folded.onnx
model-engine-file=model_b100_gpu0_fp16.engine
#int8-calib-file=calib.table
labelfile-path=labels.txt
batch-size=100
network-mode=2
num-detected-classes=1
interval=0
gie-unique-id=1
process-mode=1
network-type=0
cluster-mode=2
maintain-aspect-ratio=1
symmetric-padding=1
workspace-size=16384
#parse-bbox-func-name=NvDsInferParseYolo
parse-bbox-func-name=NvDsInferParseYoloCuda
custom-lib-path=nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
engine-create-func-name=NvDsInferYoloCudaEngineGet
[class-attrs-all]
nms-iou-threshold=0.45
pre-cluster-threshold=0.35
topk=300
Batch size in DeepStream is currently set as:
-
batch-size=100
-
minShapes=1, optShapes=50, maxShapes=100 (during engine build)
TensorRT Performance (trtexec)
Without --streams:
-
Throughput: ~774.9 QPS
-
Mean Latency: ~1.48 ms
-
GPU Compute Time (mean): ~1.28 ms
-
Coeff of variance: ~18.8%
With --streams enabled:
-
Throughput: ~1379.2 QPS
-
Mean Latency: ~6.0 ms (note: parallel streams, latency less reliable)
-
GPU Compute Time (mean): ~5.76 ms
-
Coeff of variance: ~12.6%
My Question
Given the above QPS results and DeepStream nvinfer settings:
-
How should I set TensorRT engine batch sizes (min/opt/max) and DeepStream
nvinferbatch-size to maximize GPU utilization and achieve >4000 FPS throughput? -
How does the trtexec QPS (with vs. without streams) translate into real DeepStream FPS when the engine is deployed?
-
Is there a recommended formula or rule of thumb for choosing batch size so that throughput scales without hitting diminishing returns (e.g., higher latency or inefficient GPU scheduling)?
-
Does using multiple inference streams at the TensorRT level map effectively when DeepStream batches frames, or should one rely more on larger DeepStream batch-size instead?
Any guidance or best practices from the NVIDIA team or the community would be very helpful.
Thank you