• Hardware Platform (Jetson / GPU)
both
• DeepStream Version
6.2
I’ve been profiling nvinferserver inference with different max_batch_size combinations.
Despite NVIDIA triton sample configs for most NGC ETLT models use batch size greater than 1 for the tao-conversion and the corresponding max_batch_size for nvinferserver & pbtxt model configs, in my tests there’s no inference FPS improvement of increasing the batch size in comparison to the TensorRT engine with fixed batch of 1 when triton is serving inference for multiple parallel deployments.
In fact we just observe the following two major downsides of increasing batch size:
- more time required to perform the TensorRT engine creation with
tao-converterortrtexec - increased GPU memory footprint when loading the TensorRT engine in triton server
Experiment #1: pipeline with TrafficCamNet ETLT model:
// Configuration settings
nvinferserver_config.max_batch_size = 1;
triton_model_config.max_batch_size = 1;
// -m <maximum batch size of the TRT engine>
tao_converter_args.push("-m".into());
tao_converter_args.push("1".into());
// Results
1 single deployment: 120 FPS (used GPU 1083MiB)
2 parallel deployments: 62 FPS (used GPU 1368MiB)
3 parallel deployments: 41 FPS (used GPU 1684MiB)
4 parallel deployments: 31 FPS (used GPU 1986MiB)
5 parallel deployments: 25 FPS (used GPU 2256MiB)
-------------------------------------
// Configuration settings
nvinferserver_config.max_batch_size = 8;
triton_model_config.max_batch_size = 8;
// -m <maximum batch size of the TRT engine>
tao_converter_args.push("-m".into());
tao_converter_args.push("8".into());
// Results
1 single deployment: 120 FPS (used GPU 1227MiB)
2 parallel deployments: 62 FPS (used GPU 1547MiB)
3 parallel deployments: 42 FPS (used GPU 1852MiB)
4 parallel deployments: 31 FPS (used GPU 2134MiB)
5 parallel deployments: 25 FPS (used GPU 2403MiB)
Experiment #2: pipeline with a custom YoloR ONNX model:
// Configuration settings
nvinferserver_config.max_batch_size = 1;
triton_model_config.max_batch_size = 1;
// trtexec input shapes
--minShapes=input:1x{input_shape}
--optShapes=input:1x{input_shape}
--maxShapes=input:1x{input_shape}
1 single deployment: 26 FPS (used GPU 1498MiB)
2 parallel deployments: 13 FPS (used GPU 1768MiB)
3 parallel deployments: 9 FPS (used GPU 2104MiB)
4 parallel deployments: 6 FPS (used GPU 2366MiB)
5 parallel deployments: 5 FPS (used GPU 2664MiB)
-------------------------------------
// Configuration settings
nvinferserver_config.max_batch_size = 8;
triton_model_config.max_batch_size = 8;
// trtexec input shapes
--minShapes=input:1x{input_shape}
--optShapes=input:4x{input_shape}
--maxShapes=input:8x{input_shape}
1 single deployment: 25 FPS (used GPU 2091MiB)
2 parallel deployments: 13 FPS (used GPU 2379MiB)
3 parallel deployments: 9 FPS (used GPU 2713MiB)
4 parallel deployments: 7 FPS (used GPU 2962MiB)
5 parallel deployments: 5 FPS (used GPU 3183MiB)
-------------------------------------
nvinfer (batch=1 with dedicated inference engine per deployment)
1 single deployment: 28 FPS (used GPU 1470MiB)
2 parallel deployments: 14 FPS (used GPU 2104MiB)
3 parallel deployments: 9 FPS (used GPU 2728MiB)
4 parallel deployments: 7 FPS (used GPU 3323MiB)
5 parallel deployments: 5 FPS (used GPU 3927MiB)
Apparently there’s no Triton dynamic batching enabled for any of the model configuration examples that use max_batch_size > 1 in the following repository:
This makes me wonder if the nvinferserver & triton configs that NVIDIA currently shares in GitHub are just using single image batch (1x) inference even if higher max_batch_size is set.
PS: In a separate test I’ve also added dynamic_batching { } (default settings) and even tweaked max_queue_delay_microseconds but for such models there was no improvements in the parallel inference times in comparison to the configs that don’t set the dynamic_batching when using DeepStream nvinferserver + triton server.
Additionally, despite the bigger memory footprint nvinfer with single batch 1x shows equivalent performance!