Latency linearly increases when increased batch size or concurrent models Tensorrt

megan1 · September 14, 2021, 5:06pm

Description

We are running a trt model with tensorrt_platform on triton inference server. When using the perf_analyzer we see linear increase in latency when we increase batch size:

batch size 1: 49.4 infer/sec, latency 20488 usec
batch size 2: 49.2 infer/sec, latency 41419 usec
batch size 3: 48.6 infer/sec, latency 62209 usec

we also see near-linear increase in latency when we add concurrency:

Concurrency: 1, throughput: 51.4 infer/sec, latency 20192 usec
Concurrency: 2, throughput: 60.4 infer/sec, latency 33602 usec
Concurrency: 3, throughput: 59.8 infer/sec, latency 51003 usec
Concurrency: 4, throughput: 59 infer/sec, latency 68353 usec

I have increased max batch size to 8G, there is no change in outcome.
Model runs batch-agnostic and 10x faster when using torch2trt TRT_Module.

similar issues without (non-generic) answer:

Can tensorrt models be made batch-size agnostic? what about concurrency-agnostic?

Environment

TensorRT Version: 7.2.3.4
GPU Type: T4 (same issue occurs on V100)
Nvidia Driver Version: 450.119.03
CUDA Version: 11.3
Operating System + Version: Ubuntu 18.04
Container (if container which image + tag): tritonserver:21.05-py3

Relevant Files

we are running a auto-encoder model on 360x640 images, cannot share here due to IP

Steps To Reproduce

export model to tensorrt (using onnx2trt or torch2trt - same outcome with both), with dynamic batch size
use triton tensorrt_plan platform
run perf_analyzer on trt model

megan1 · September 14, 2021, 5:52pm

The issue persists when running tensorrt outside of triton. when I run tensorrt inference in python directly using tensorrt and pycuda inference time:
Batch size 1: 0.036710262298583984
Batch size 2: 0.06481266021728516
Batch size 3: 0.09190487861633301

megan1 · September 24, 2021, 7:43pm

solved the concurrency issue when properly using instance groups on a small model. batching is still an issue, but related to tensorrt, not triton. marking as solved.

nadeemm · October 1, 2021, 3:00pm

This topic was automatically closed after 6 days. New replies are no longer allowed.

Topic		Replies	Views
Latency proportionally increases with batch size TensorRT	2	1046	September 12, 2021
Latency linearly increases when increased batch size or concurrent models TensorRT inference-server-triton	15	2012	September 29, 2021
the latency time is linearly increasing when concurrent threads increase more than 2 TensorRT	6	1277	March 15, 2019
Latency proportionally increases with batch size TensorRT	1	500	July 26, 2019
TensorRT 5.0.2 Batch Size Problem: bigger batch size Inference Time increase??? General	6	1529	October 12, 2021
TRT inference on batches is not giving any performance benefit Jetson TX2 tensorrt , nvbugs	11	1140	October 18, 2021
Optimization using Inference batch size General Topics and Other SDKs	1	1005	January 19, 2022
Inference on large batch size TensorRT	5	4572	September 21, 2018
TensorRT 5.X / 6.X Batch Size Problem TensorRT	4	605	August 19, 2020
Inference speed of Triton Server Triton Inference Server - archived tensorrt , python , inference-server-triton	0	582	December 19, 2023

Latency linearly increases when increased batch size or concurrent models Tensorrt

Description

Environment

Relevant Files

Steps To Reproduce

Related topics