Nvinferserver (Triton server) doesn't improves inference FPS for dynamic batching models

rmpt · October 23, 2023, 5:49pm

• Hardware Platform (Jetson / GPU)
both
• DeepStream Version
6.2

I’ve been profiling nvinferserver inference with different max_batch_size combinations.

Despite NVIDIA triton sample configs for most NGC ETLT models use batch size greater than 1 for the tao-conversion and the corresponding max_batch_size for nvinferserver & pbtxt model configs, in my tests there’s no inference FPS improvement of increasing the batch size in comparison to the TensorRT engine with fixed batch of 1 when triton is serving inference for multiple parallel deployments.

In fact we just observe the following two major downsides of increasing batch size:

more time required to perform the TensorRT engine creation with tao-converter or trtexec
increased GPU memory footprint when loading the TensorRT engine in triton server

Experiment #1: pipeline with TrafficCamNet ETLT model:

// Configuration settings
nvinferserver_config.max_batch_size = 1;
triton_model_config.max_batch_size = 1;
// -m <maximum batch size of the TRT engine>
tao_converter_args.push("-m".into());
tao_converter_args.push("1".into());

// Results
1 single deployment:     120 FPS (used GPU 1083MiB)
2 parallel deployments:   62 FPS (used GPU 1368MiB)
3 parallel deployments:   41 FPS (used GPU 1684MiB)
4 parallel deployments:   31 FPS (used GPU 1986MiB)
5 parallel deployments:   25 FPS (used GPU 2256MiB)

-------------------------------------
// Configuration settings
nvinferserver_config.max_batch_size = 8;
triton_model_config.max_batch_size = 8;
// -m <maximum batch size of the TRT engine>
tao_converter_args.push("-m".into());
tao_converter_args.push("8".into());

// Results
1 single deployment:   120 FPS (used GPU 1227MiB)
2 parallel deployments: 62 FPS (used GPU 1547MiB)
3 parallel deployments: 42 FPS (used GPU 1852MiB)
4 parallel deployments: 31 FPS (used GPU 2134MiB)
5 parallel deployments: 25 FPS (used GPU 2403MiB)

Experiment #2: pipeline with a custom YoloR ONNX model:

// Configuration settings
nvinferserver_config.max_batch_size = 1;
triton_model_config.max_batch_size = 1;
// trtexec input shapes
--minShapes=input:1x{input_shape}
--optShapes=input:1x{input_shape}
--maxShapes=input:1x{input_shape}

1 single deployment:    26 FPS (used GPU 1498MiB)
2 parallel deployments: 13 FPS (used GPU 1768MiB)
3 parallel deployments:  9 FPS (used GPU 2104MiB)
4 parallel deployments:  6 FPS (used GPU 2366MiB)
5 parallel deployments:  5 FPS (used GPU 2664MiB)

-------------------------------------
// Configuration settings
nvinferserver_config.max_batch_size = 8;
triton_model_config.max_batch_size = 8;
// trtexec input shapes
--minShapes=input:1x{input_shape}
--optShapes=input:4x{input_shape}
--maxShapes=input:8x{input_shape}

1 single deployment:    25 FPS (used GPU 2091MiB)
2 parallel deployments: 13 FPS (used GPU 2379MiB)
3 parallel deployments:  9 FPS (used GPU 2713MiB)
4 parallel deployments:  7 FPS (used GPU 2962MiB)
5 parallel deployments:  5 FPS (used GPU 3183MiB)

-------------------------------------

nvinfer (batch=1 with dedicated inference engine per deployment)

1 single deployment:    28 FPS (used GPU 1470MiB)
2 parallel deployments: 14 FPS (used GPU 2104MiB)
3 parallel deployments:  9 FPS (used GPU 2728MiB)
4 parallel deployments:  7 FPS (used GPU 3323MiB)
5 parallel deployments:  5 FPS (used GPU 3927MiB)

Apparently there’s no Triton dynamic batching enabled for any of the model configuration examples that use max_batch_size > 1 in the following repository:

This makes me wonder if the nvinferserver & triton configs that NVIDIA currently shares in GitHub are just using single image batch (1x) inference even if higher max_batch_size is set.

PS: In a separate test I’ve also added dynamic_batching { } (default settings) and even tweaked max_queue_delay_microseconds but for such models there was no improvements in the parallel inference times in comparison to the configs that don’t set the dynamic_batching when using DeepStream nvinferserver + triton server.
Additionally, despite the bigger memory footprint nvinfer with single batch 1x shows equivalent performance!

fanzh · October 25, 2023, 2:58pm

There is no update from you for a period, assuming this is not an issue any more. Hence we are closing this topic. If need further support, please open a new one. Thanks.
deepstream-test1 can support nvinferserver. could you use deepstream sample to reproduce this issue? please share the simplified code and configuration file. Thanks!

system · November 8, 2023, 2:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance issue with dynamic batching on Triton Inference Server Triton Inference Server (archived) tensorflow , inference-server-triton	0	2542	May 6, 2021
Does triton-deepstream support dynamic batching? How to config it? DeepStream SDK	2	421	October 26, 2021
YOLOV4- DS-TRITON \| Configuration specified max-batch 4 but TensorRT engine only supports max-batch 1 DeepStream SDK tensorrt , inference-server-triton	13	4307	October 12, 2021
Dynamic batching for tensorrt engine model TAO Toolkit	5	1242	January 4, 2022
Test triton with jmeter, much less throughoutput than perf-analyzer TensorRT inference-server-triton	1	529	November 15, 2023
Deepstream nvinfer batch size and Tensorrt engine QPS Relationship DeepStream SDK deepstream	3	81	November 7, 2025
Performance about nvinfer and nvinferserver DeepStream SDK	6	1586	March 22, 2022
Tensorrt inference with batch > 1 TensorRT	4	1479	October 13, 2022
Inference FLickers on Nvstreeammux Batch-size increase to number of streams DeepStream SDK deepstream	43	433	September 30, 2025
Nvinfer and streammux batch-size property doubts DeepStream SDK	3	856	September 19, 2022

Nvinferserver (Triton server) doesn't improves inference FPS for dynamic batching models

Related topics