Latency linearly increases when increased batch size or concurrent models

Description

We are running a trt model with tensorrt_platform on triton inference server. When using the perf_analyzer we see linear increase in latency when we increase batch size:

batch size 1: 49.4 infer/sec, latency 20488 usec
batch size 2: 49.2 infer/sec, latency 41419 usec
batch size 3: 48.6 infer/sec, latency 62209 usec

we also see near-linear increase in latency when we add concurrency:

Concurrency: 1, throughput: 51.4 infer/sec, latency 20192 usec
Concurrency: 2, throughput: 60.4 infer/sec, latency 33602 usec
Concurrency: 3, throughput: 59.8 infer/sec, latency 51003 usec
Concurrency: 4, throughput: 59 infer/sec, latency 68353 usec

I have increased max batch size to 8G, there is no change in outcome.
Model runs batch-agnostic and 10x faster when using torch2trt TRT_Module.

similar issues without (non-generic) answer:

Can tensorrt models be made batch-size agnostic? what about concurrency-agnostic?

Environment

TensorRT Version: 7.2.3.4
GPU Type: T4 (same issue occurs on V100)
Nvidia Driver Version: 450.119.03
CUDA Version: 11.3
Operating System + Version: Ubuntu 18.04
Container (if container which image + tag): tritonserver:21.05-py3

Relevant Files

we are running a auto-encoder model on 360x640 images, cannot share here due to IP

Steps To Reproduce

  1. export model to tensorrt (using onnx2trt or torch2trt - same outcome with both), with dynamic batch size
  2. use triton tensorrt_plan platform
  3. run perf_analyzer on trt model

Hi @megan1,

We recommend you to please post your concern on Triton related forum to get better help.

Thank you.

reposted: Latency linearly increases when increased batch size or concurrent models Tensorrt

FYI The issue persists when running tensorrt outside of triton. when I run tensorrt inference in python directly using tensorrt and pycuda inference time:
Batch size 1: 0.036710262298583984
Batch size 2: 0.06481266021728516
Batch size 3: 0.09190487861633301

seems to be tensorrt issue, not triton issue.

Hi @megan1 ,

Thank you for confirming that, you can reproduce the issue outside of triton.

We recommend you to please try on the latest TensorRT version, While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Also you can run nvidia-smi dmon -s u to check gpu utilization (for different batch sizes) and confirm whether on batchsize 1 only GPU is getting utilized 100%. Or use Nsight Systems to visualize the profiles: NVIDIA Nsight Systems | NVIDIA Developer

If you still face this issue, please share us minimal issue repro scripts/model and steps to try from our end for better help.

Thank you.

problem persists with TRT 8.0.16

there is no model pre and post processing.

I have included both throughput and latency in my logs above.

I can run a batch of size of 64 when using torch2trt, with batch agnostic latency. I run out of GPU space when I do batch size of 128.

You should be able to reproduce with any auto-encoder network. I have reproduced this same issue with every network I have tried, including just a simple bilinear interpolation model. As I mentioned I cannot provide my models due to IP.

here are some example files, created in TRT 7.2.3.4:

this model is just a simple bilinear upsample model, but you can easily see the linear increase in latency as batch size increases.

model.engine was created by: python3 write_trt.py

inference the model by: python3 inference_trt.py -b <batch_size>

inference_trt.py (2.9 KB)
model.engine (2.0 KB)
write_trt.py (1.0 KB)

Thank you for sharing the model. Please allow us some time to get back on this.

Hi,

Sorry we are facing some issues when we try to build using torch2trt, Could you please share us onnx model for the above simple model.

torch2trt is not compatible with TRT version 8.*

I’ve attached the onnx file
input name: ‘IMAGE’ size Bx180x320x3
output name: ‘OUTPUT’ size Bx360x640x3

bilinear.onnx (888 Bytes)

here are example outputs from trtexec, running like: ./trtexec --onnx=bilinear.onnx --minShapes='IMAGE':1x180x320x3,'OUTPUT':1x360x640x3 --optShapes='IMAGE':Bx180x320x3,'OUTPUT':Bx360x640x3 --maxShapes='IMAGE':64x180x320x3,'OUTPUT':64x360x640x3

you’ll see linear increase in latency.

B=1:

[09/21/2021-20:10:07] [I] Average on 10 runs - GPU latency: 0.0435547 ms - Host latency: 0.611108 ms (end to end 0.81272 ms, enqueue 0.0090332 ms)
[09/21/2021-20:10:07] [I] Host Latency
[09/21/2021-20:10:07] [I] min: 0.593262 ms (end to end 0.628418 ms)
[09/21/2021-20:10:07] [I] max: 0.616211 ms (end to end 0.867676 ms)
[09/21/2021-20:10:07] [I] mean: 0.611237 ms (end to end 0.814108 ms)
[09/21/2021-20:10:07] [I] median: 0.611023 ms (end to end 0.814697 ms)
[09/21/2021-20:10:07] [I] percentile: 0.614563 ms at 99% (end to end 0.816162 ms at 99%)
[09/21/2021-20:10:07] [I] throughput: 0 qps
[09/21/2021-20:10:07] [I] walltime: 3.00143 s
[09/21/2021-20:10:07] [I] Enqueue Time
[09/21/2021-20:10:07] [I] min: 0.00805664 ms
[09/21/2021-20:10:07] [I] max: 0.0180664 ms
[09/21/2021-20:10:07] [I] median: 0.0090332 ms
[09/21/2021-20:10:07] [I] GPU Compute
[09/21/2021-20:10:07] [I] min: 0.0415039 ms
[09/21/2021-20:10:07] [I] max: 0.0471191 ms
[09/21/2021-20:10:07] [I] mean: 0.0437881 ms
[09/21/2021-20:10:07] [I] median: 0.043457 ms
[09/21/2021-20:10:07] [I] percentile: 0.046875 ms at 99%
[09/21/2021-20:10:07] [I] total compute time: 0.297453 s

B=2

[09/21/2021-20:16:37] [I] Average on 10 runs - GPU latency: 0.0808105 ms - Host latency: 1.20684 ms (end to end 1.68406 ms, enqueue 0.00878906 ms)
[09/21/2021-20:16:37] [I] Host Latency
[09/21/2021-20:16:37] [I] min: 1.17065 ms (end to end 1.63037 ms)
[09/21/2021-20:16:37] [I] max: 1.21362 ms (end to end 1.69461 ms)
[09/21/2021-20:16:37] [I] mean: 1.20674 ms (end to end 1.6848 ms)
[09/21/2021-20:16:37] [I] median: 1.20654 ms (end to end 1.6853 ms)
[09/21/2021-20:16:37] [I] percentile: 1.20856 ms at 99% (end to end 1.68738 ms at 99%)
[09/21/2021-20:16:37] [I] throughput: 0 qps
[09/21/2021-20:16:37] [I] walltime: 3.00238 s
[09/21/2021-20:16:37] [I] Enqueue Time
[09/21/2021-20:16:37] [I] min: 0.00756836 ms
[09/21/2021-20:16:37] [I] max: 0.0177612 ms
[09/21/2021-20:16:37] [I] median: 0.00878906 ms
[09/21/2021-20:16:37] [I] GPU Compute
[09/21/2021-20:16:37] [I] min: 0.0787354 ms
[09/21/2021-20:16:37] [I] max: 0.0838013 ms
[09/21/2021-20:16:37] [I] mean: 0.0807835 ms
[09/21/2021-20:16:37] [I] median: 0.0804443 ms
[09/21/2021-20:16:37] [I] percentile: 0.0820923 ms at 99%
[09/21/2021-20:16:37] [I] total compute time: 0.276199 s

B=3

[09/21/2021-20:17:27] [I] Average on 10 runs - GPU latency: 0.117114 ms - Host latency: 1.80073 ms (end to end 2.55527 ms, enqueue 0.00908203 ms)
[09/21/2021-20:17:27] [I] Host Latency
[09/21/2021-20:17:27] [I] min: 1.74707 ms (end to end 2.5 ms)
[09/21/2021-20:17:27] [I] max: 1.80957 ms (end to end 2.56079 ms)
[09/21/2021-20:17:27] [I] mean: 1.80084 ms (end to end 2.55501 ms)
[09/21/2021-20:17:27] [I] median: 1.80066 ms (end to end 2.55554 ms)
[09/21/2021-20:17:27] [I] percentile: 1.80298 ms at 99% (end to end 2.55713 ms at 99%)
[09/21/2021-20:17:27] [I] throughput: 0 qps
[09/21/2021-20:17:27] [I] walltime: 3.00303 s
[09/21/2021-20:17:27] [I] Enqueue Time
[09/21/2021-20:17:27] [I] min: 0.00805664 ms
[09/21/2021-20:17:27] [I] max: 0.0180664 ms
[09/21/2021-20:17:27] [I] median: 0.00915527 ms
[09/21/2021-20:17:27] [I] GPU Compute
[09/21/2021-20:17:27] [I] min: 0.114868 ms
[09/21/2021-20:17:27] [I] max: 0.120102 ms
[09/21/2021-20:17:27] [I] mean: 0.117101 ms
[09/21/2021-20:17:27] [I] median: 0.116943 ms
[09/21/2021-20:17:27] [I] percentile: 0.118896 ms at 99%
[09/21/2021-20:17:27] [I] total compute time: 0.267577 s

B=4

[09/21/2021-20:17:57] [I] Average on 10 runs - GPU latency: 0.159546 ms - Host latency: 2.4011 ms (end to end 3.42803 ms, enqueue 0.0090332 ms)
[09/21/2021-20:17:57] [I] Host Latency
[09/21/2021-20:17:57] [I] min: 2.33008 ms (end to end 3.35645 ms)
[09/21/2021-20:17:57] [I] max: 2.40906 ms (end to end 3.43738 ms)
[09/21/2021-20:17:57] [I] mean: 2.40099 ms (end to end 3.42803 ms)
[09/21/2021-20:17:57] [I] median: 2.40112 ms (end to end 3.42847 ms)
[09/21/2021-20:17:57] [I] percentile: 2.40332 ms at 99% (end to end 3.43005 ms at 99%)
[09/21/2021-20:17:57] [I] throughput: 0 qps
[09/21/2021-20:17:57] [I] walltime: 3.0056 s
[09/21/2021-20:17:57] [I] Enqueue Time
[09/21/2021-20:17:57] [I] min: 0.00805664 ms
[09/21/2021-20:17:57] [I] max: 0.0178833 ms
[09/21/2021-20:17:57] [I] median: 0.00915527 ms
[09/21/2021-20:17:57] [I] GPU Compute
[09/21/2021-20:17:57] [I] min: 0.157593 ms
[09/21/2021-20:17:57] [I] max: 0.162079 ms
[09/21/2021-20:17:57] [I] mean: 0.159489 ms
[09/21/2021-20:17:57] [I] median: 0.159668 ms
[09/21/2021-20:17:57] [I] percentile: 0.161682 ms at 99%
[09/21/2021-20:17:57] [I] total compute time: 0.273842 s

B=8

[09/21/2021-20:18:30] [I] Average on 10 runs - GPU latency: 0.312866 ms - Host latency: 4.78594 ms (end to end 6.91038 ms, enqueue 0.00964355 ms)
[09/21/2021-20:18:30] [I] Host Latency
[09/21/2021-20:18:30] [I] min: 4.64575 ms (end to end 6.76855 ms)
[09/21/2021-20:18:30] [I] max: 4.79071 ms (end to end 6.91711 ms)
[09/21/2021-20:18:30] [I] mean: 4.78564 ms (end to end 6.91054 ms)```
[09/21/2021-20:18:30] [I] median: 4.78589 ms (end to end 6.91125 ms)
[09/21/2021-20:18:30] [I] percentile: 4.78845 ms at 99% (end to end 6.91321 ms at 99%)
[09/21/2021-20:18:30] [I] throughput: 0 qps
[09/21/2021-20:18:30] [I] walltime: 3.01061 s
[09/21/2021-20:18:30] [I] Enqueue Time
[09/21/2021-20:18:30] [I] min: 0.00842285 ms
[09/21/2021-20:18:30] [I] max: 0.0207214 ms
[09/21/2021-20:18:30] [I] median: 0.00952148 ms
[09/21/2021-20:18:30] [I] GPU Compute
[09/21/2021-20:18:30] [I] min: 0.309448 ms
[09/21/2021-20:18:30] [I] max: 0.317429 ms
[09/21/2021-20:18:30] [I] mean: 0.312851 ms
[09/21/2021-20:18:30] [I] median: 0.31308 ms
[09/21/2021-20:18:30] [I] percentile: 0.31543 ms at 99%
[09/21/2021-20:18:30] [I] total compute time: 0.269365 s

B=64

[09/21/2021-20:19:19] [I] Average on 10 runs - GPU latency: 2.46104 ms - Host latency: 38.059 ms (end to end 55.6042 ms, enqueue 0.0102051 ms)
[09/21/2021-20:19:19] [I] Host Latency
[09/21/2021-20:19:19] [I] min: 37.0532 ms (end to end 54.5974 ms)
[09/21/2021-20:19:19] [I] max: 38.1917 ms (end to end 55.7484 ms)
[09/21/2021-20:19:19] [I] mean: 38.1627 ms (end to end 55.7096 ms)
[09/21/2021-20:19:19] [I] median: 38.1709 ms (end to end 55.7176 ms)
[09/21/2021-20:19:19] [I] percentile: 38.1875 ms at 99% (end to end 55.7404 ms at 99%)
[09/21/2021-20:19:19] [I] throughput: 0 qps
[09/21/2021-20:19:19] [I] walltime: 3.09588 s
[09/21/2021-20:19:19] [I] Enqueue Time
[09/21/2021-20:19:19] [I] min: 0.00939941 ms
[09/21/2021-20:19:19] [I] max: 0.0240479 ms
[09/21/2021-20:19:19] [I] median: 0.0102539 ms
[09/21/2021-20:19:19] [I] GPU Compute
[09/21/2021-20:19:19] [I] min: 2.45557 ms
[09/21/2021-20:19:19] [I] max: 2.46692 ms
[09/21/2021-20:19:19] [I] mean: 2.46089 ms
[09/21/2021-20:19:19] [I] median: 2.46118 ms
[09/21/2021-20:19:19] [I] percentile: 2.46533 ms at 99%
[09/21/2021-20:19:19] [I] total compute time: 0.270698 s

Hi @megan1,

We could reproduce the issue, Please allow us sometime to work on this.

Thank you.

Hi @megan1,

Our team is looking into this issue. Could you please elaborate more below statements.

Model runs batch-agnostic and 10x faster when using torch2trt TRT_Module.
Can tensorrt models be made batch-size agnostic? what about concurrency-agnostic?

What does “batch-size agnostic” mean in this context? Meaning building and engine with minShapes==optShapes==maxShapes? Or something else?

Thank you.

batch agnostic meaning same latency as batch size increases: as long as there is gpu compute available, the model should run a forward pass in the same amount of time regardless of batch size. for example, this is how pytorch works and how trt models work in torch2trt.

Thank you.

speaking with Yuki Ni in nvbugs: https://developer.nvidia.com/nvidia_bug/3384703 and https://developer.nvidia.com/nvidia_bug/3389191

He has asked for torch2trt reproduction code and model. I’m attaching a zip which contains “write_bilinear.py” which writes for the bilinear model I shared above:

  • trt .engine file
  • torch2trt “TRTModule” .pth file
  • .onnx file

It also contains all those models (.engine, .pth, .onnx) created on my machine: torch==1.9.0, nvidia-tensorrt==7.2.3.4, torch2trt=0.3.0

It will also profile the TRTModule inference latency average over 1000 runs. This is the program output on my T4:

batch size 1: 0.057 ms
batch size 2: 0.057 ms
batch size 3: 0.057 ms
batch size 4: 0.057 ms
batch size 8: 0.057 ms
batch size 64: 0.058 ms

bilinear_files.zip (3.5 KB)