We are running a trt model with tensorrt_platform on triton inference server. When using the perf_analyzer we see linear increase in latency when we increase batch size:
batch size 1: 49.4 infer/sec, latency 20488 usec
batch size 2: 49.2 infer/sec, latency 41419 usec
batch size 3: 48.6 infer/sec, latency 62209 usec
we also see near-linear increase in latency when we add concurrency:
Concurrency: 1, throughput: 51.4 infer/sec, latency 20192 usec
Concurrency: 2, throughput: 60.4 infer/sec, latency 33602 usec
Concurrency: 3, throughput: 59.8 infer/sec, latency 51003 usec
Concurrency: 4, throughput: 59 infer/sec, latency 68353 usec
I have increased max batch size to 8G, there is no change in outcome.
Model runs batch-agnostic and 10x faster when using torch2trt TRT_Module.
similar issues without (non-generic) answer:
Can tensorrt models be made batch-size agnostic? what about concurrency-agnostic?
TensorRT Version: 188.8.131.52
GPU Type: T4 (same issue occurs on V100)
Nvidia Driver Version: 450.119.03
CUDA Version: 11.3
Operating System + Version: Ubuntu 18.04
Container (if container which image + tag): tritonserver:21.05-py3
we are running a auto-encoder model on 360x640 images, cannot share here due to IP
- export model to tensorrt (using onnx2trt or torch2trt - same outcome with both), with dynamic batch size
- use triton tensorrt_plan platform
- run perf_analyzer on trt model