Latency linearly increases when increased batch size or concurrent models Tensorrt


We are running a trt model with tensorrt_platform on triton inference server. When using the perf_analyzer we see linear increase in latency when we increase batch size:

batch size 1: 49.4 infer/sec, latency 20488 usec
batch size 2: 49.2 infer/sec, latency 41419 usec
batch size 3: 48.6 infer/sec, latency 62209 usec

we also see near-linear increase in latency when we add concurrency:

Concurrency: 1, throughput: 51.4 infer/sec, latency 20192 usec
Concurrency: 2, throughput: 60.4 infer/sec, latency 33602 usec
Concurrency: 3, throughput: 59.8 infer/sec, latency 51003 usec
Concurrency: 4, throughput: 59 infer/sec, latency 68353 usec

I have increased max batch size to 8G, there is no change in outcome.
Model runs batch-agnostic and 10x faster when using torch2trt TRT_Module.

similar issues without (non-generic) answer:

Can tensorrt models be made batch-size agnostic? what about concurrency-agnostic?


TensorRT Version:
GPU Type: T4 (same issue occurs on V100)
Nvidia Driver Version: 450.119.03
CUDA Version: 11.3
Operating System + Version: Ubuntu 18.04
Container (if container which image + tag): tritonserver:21.05-py3

Relevant Files

we are running a auto-encoder model on 360x640 images, cannot share here due to IP

Steps To Reproduce

  1. export model to tensorrt (using onnx2trt or torch2trt - same outcome with both), with dynamic batch size
  2. use triton tensorrt_plan platform
  3. run perf_analyzer on trt model

The issue persists when running tensorrt outside of triton. when I run tensorrt inference in python directly using tensorrt and pycuda inference time:
Batch size 1: 0.036710262298583984
Batch size 2: 0.06481266021728516
Batch size 3: 0.09190487861633301

solved the concurrency issue when properly using instance groups on a small model. batching is still an issue, but related to tensorrt, not triton. marking as solved.