Nvinferserver (Triton server) doesn't improves inference FPS for dynamic batching models

• Hardware Platform (Jetson / GPU)
both
• DeepStream Version
6.2

I’ve been profiling nvinferserver inference with different max_batch_size combinations.

Despite NVIDIA triton sample configs for most NGC ETLT models use batch size greater than 1 for the tao-conversion and the corresponding max_batch_size for nvinferserver & pbtxt model configs, in my tests there’s no inference FPS improvement of increasing the batch size in comparison to the TensorRT engine with fixed batch of 1 when triton is serving inference for multiple parallel deployments.

In fact we just observe the following two major downsides of increasing batch size:

  • more time required to perform the TensorRT engine creation with tao-converter or trtexec
  • increased GPU memory footprint when loading the TensorRT engine in triton server

Experiment #1: pipeline with TrafficCamNet ETLT model:

// Configuration settings
nvinferserver_config.max_batch_size = 1;
triton_model_config.max_batch_size = 1;
// -m <maximum batch size of the TRT engine>
tao_converter_args.push("-m".into());
tao_converter_args.push("1".into());

// Results
1 single deployment:     120 FPS (used GPU 1083MiB)
2 parallel deployments:   62 FPS (used GPU 1368MiB)
3 parallel deployments:   41 FPS (used GPU 1684MiB)
4 parallel deployments:   31 FPS (used GPU 1986MiB)
5 parallel deployments:   25 FPS (used GPU 2256MiB)

-------------------------------------
// Configuration settings
nvinferserver_config.max_batch_size = 8;
triton_model_config.max_batch_size = 8;
// -m <maximum batch size of the TRT engine>
tao_converter_args.push("-m".into());
tao_converter_args.push("8".into());

// Results
1 single deployment:   120 FPS (used GPU 1227MiB)
2 parallel deployments: 62 FPS (used GPU 1547MiB)
3 parallel deployments: 42 FPS (used GPU 1852MiB)
4 parallel deployments: 31 FPS (used GPU 2134MiB)
5 parallel deployments: 25 FPS (used GPU 2403MiB)

Experiment #2: pipeline with a custom YoloR ONNX model:

// Configuration settings
nvinferserver_config.max_batch_size = 1;
triton_model_config.max_batch_size = 1;
// trtexec input shapes
--minShapes=input:1x{input_shape}
--optShapes=input:1x{input_shape}
--maxShapes=input:1x{input_shape}

1 single deployment:    26 FPS (used GPU 1498MiB)
2 parallel deployments: 13 FPS (used GPU 1768MiB)
3 parallel deployments:  9 FPS (used GPU 2104MiB)
4 parallel deployments:  6 FPS (used GPU 2366MiB)
5 parallel deployments:  5 FPS (used GPU 2664MiB)

-------------------------------------
// Configuration settings
nvinferserver_config.max_batch_size = 8;
triton_model_config.max_batch_size = 8;
// trtexec input shapes
--minShapes=input:1x{input_shape}
--optShapes=input:4x{input_shape}
--maxShapes=input:8x{input_shape}

1 single deployment:    25 FPS (used GPU 2091MiB)
2 parallel deployments: 13 FPS (used GPU 2379MiB)
3 parallel deployments:  9 FPS (used GPU 2713MiB)
4 parallel deployments:  7 FPS (used GPU 2962MiB)
5 parallel deployments:  5 FPS (used GPU 3183MiB)

-------------------------------------

nvinfer (batch=1 with dedicated inference engine per deployment)

1 single deployment:    28 FPS (used GPU 1470MiB)
2 parallel deployments: 14 FPS (used GPU 2104MiB)
3 parallel deployments:  9 FPS (used GPU 2728MiB)
4 parallel deployments:  7 FPS (used GPU 3323MiB)
5 parallel deployments:  5 FPS (used GPU 3927MiB)

Apparently there’s no Triton dynamic batching enabled for any of the model configuration examples that use max_batch_size > 1 in the following repository:

This makes me wonder if the nvinferserver & triton configs that NVIDIA currently shares in GitHub are just using single image batch (1x) inference even if higher max_batch_size is set.

PS: In a separate test I’ve also added dynamic_batching { } (default settings) and even tweaked max_queue_delay_microseconds but for such models there was no improvements in the parallel inference times in comparison to the configs that don’t set the dynamic_batching when using DeepStream nvinferserver + triton server.
Additionally, despite the bigger memory footprint nvinfer with single batch 1x shows equivalent performance!

There is no update from you for a period, assuming this is not an issue any more. Hence we are closing this topic. If need further support, please open a new one. Thanks.
deepstream-test1 can support nvinferserver. could you use deepstream sample to reproduce this issue? please share the simplified code and configuration file. Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.