Deepstream nvinfer batch size and Tensorrt engine QPS Relationship

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) A30
• DeepStream Version 8.0
• TensorRT Version 10.0
**• Issue: Unable to reach HigherFPS with Current Engine, even with higher nvinfer batch size.

Title: How to configure TensorRT/DeepStream batch size to maximize throughput (>1300 FPS target)

Hello,

I am trying to optimize inference throughput for my ONNX model integrated into DeepStream. My goal is to understand how TensorRT engine configuration (min/opt/max batch size, streams) and DeepStream nvinfer batch size relate to achieving higher FPS, ideally above 1300 FPS.


DeepStream nvinfer configuration

Below is my current ds_demux_pgie_config.txt:

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
model-color-format=0
onnx-file=folded.onnx
model-engine-file=model_b100_gpu0_fp16.engine
#int8-calib-file=calib.table
labelfile-path=labels.txt
batch-size=100
network-mode=2
num-detected-classes=1
interval=0
gie-unique-id=1
process-mode=1
network-type=0
cluster-mode=2
maintain-aspect-ratio=1
symmetric-padding=1
workspace-size=16384
#parse-bbox-func-name=NvDsInferParseYolo
parse-bbox-func-name=NvDsInferParseYoloCuda
custom-lib-path=nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
engine-create-func-name=NvDsInferYoloCudaEngineGet

[class-attrs-all]
nms-iou-threshold=0.45
pre-cluster-threshold=0.35
topk=300

Batch size in DeepStream is currently set as:

  • batch-size=100

  • minShapes=1, optShapes=50, maxShapes=100 (during engine build)


TensorRT Performance (trtexec)

Without --streams:

  • Throughput: ~774.9 QPS

  • Mean Latency: ~1.48 ms

  • GPU Compute Time (mean): ~1.28 ms

  • Coeff of variance: ~18.8%

With --streams enabled:

  • Throughput: ~1379.2 QPS

  • Mean Latency: ~6.0 ms (note: parallel streams, latency less reliable)

  • GPU Compute Time (mean): ~5.76 ms

  • Coeff of variance: ~12.6%


My Question

Given the above QPS results and DeepStream nvinfer settings:

  1. How should I set TensorRT engine batch sizes (min/opt/max) and DeepStream nvinfer batch-size to maximize GPU utilization and achieve >4000 FPS throughput?

  2. How does the trtexec QPS (with vs. without streams) translate into real DeepStream FPS when the engine is deployed?

  3. Is there a recommended formula or rule of thumb for choosing batch size so that throughput scales without hitting diminishing returns (e.g., higher latency or inefficient GPU scheduling)?

  4. Does using multiple inference streams at the TensorRT level map effectively when DeepStream batches frames, or should one rely more on larger DeepStream batch-size instead?

Any guidance or best practices from the NVIDIA team or the community would be very helpful.

Thank you

1 Like

The trtexec only runs the TensorRT engine performance while DeepStream pipeline/app is a complete application which involves many different resources and modules. They are totally different things, there is no way to calculate such thing.

In general, the most slow module(bottleneck) decided the performance(FPS) of the DeepStream pipeline/app.

No such formula or rule for DeepStream. You can just refer to Troubleshooting — DeepStream documentation

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.