We are running a trt model with tensorrt_platform on triton inference server. When using the perf_analyzer we see linear increase in latency when we increase batch size:
FYI The issue persists when running tensorrt outside of triton. when I run tensorrt inference in python directly using tensorrt and pycuda inference time:
Batch size 1: 0.036710262298583984
Batch size 2: 0.06481266021728516
Batch size 3: 0.09190487861633301
Also you can run nvidia-smi dmon -s u to check gpu utilization (for different batch sizes) and confirm whether on batchsize 1 only GPU is getting utilized 100%. Or use Nsight Systems to visualize the profiles: NVIDIA Nsight Systems | NVIDIA Developer
If you still face this issue, please share us minimal issue repro scripts/model and steps to try from our end for better help.
I have included both throughput and latency in my logs above.
I can run a batch of size of 64 when using torch2trt, with batch agnostic latency. I run out of GPU space when I do batch size of 128.
You should be able to reproduce with any auto-encoder network. I have reproduced this same issue with every network I have tried, including just a simple bilinear interpolation model. As I mentioned I cannot provide my models due to IP.
here are example outputs from trtexec, running like: ./trtexec --onnx=bilinear.onnx --minShapes='IMAGE':1x180x320x3,'OUTPUT':1x360x640x3 --optShapes='IMAGE':Bx180x320x3,'OUTPUT':Bx360x640x3 --maxShapes='IMAGE':64x180x320x3,'OUTPUT':64x360x640x3
you’ll see linear increase in latency.
B=1:
[09/21/2021-20:10:07] [I] Average on 10 runs - GPU latency: 0.0435547 ms - Host latency: 0.611108 ms (end to end 0.81272 ms, enqueue 0.0090332 ms)
[09/21/2021-20:10:07] [I] Host Latency
[09/21/2021-20:10:07] [I] min: 0.593262 ms (end to end 0.628418 ms)
[09/21/2021-20:10:07] [I] max: 0.616211 ms (end to end 0.867676 ms)
[09/21/2021-20:10:07] [I] mean: 0.611237 ms (end to end 0.814108 ms)
[09/21/2021-20:10:07] [I] median: 0.611023 ms (end to end 0.814697 ms)
[09/21/2021-20:10:07] [I] percentile: 0.614563 ms at 99% (end to end 0.816162 ms at 99%)
[09/21/2021-20:10:07] [I] throughput: 0 qps
[09/21/2021-20:10:07] [I] walltime: 3.00143 s
[09/21/2021-20:10:07] [I] Enqueue Time
[09/21/2021-20:10:07] [I] min: 0.00805664 ms
[09/21/2021-20:10:07] [I] max: 0.0180664 ms
[09/21/2021-20:10:07] [I] median: 0.0090332 ms
[09/21/2021-20:10:07] [I] GPU Compute
[09/21/2021-20:10:07] [I] min: 0.0415039 ms
[09/21/2021-20:10:07] [I] max: 0.0471191 ms
[09/21/2021-20:10:07] [I] mean: 0.0437881 ms
[09/21/2021-20:10:07] [I] median: 0.043457 ms
[09/21/2021-20:10:07] [I] percentile: 0.046875 ms at 99%
[09/21/2021-20:10:07] [I] total compute time: 0.297453 s
B=2
[09/21/2021-20:16:37] [I] Average on 10 runs - GPU latency: 0.0808105 ms - Host latency: 1.20684 ms (end to end 1.68406 ms, enqueue 0.00878906 ms)
[09/21/2021-20:16:37] [I] Host Latency
[09/21/2021-20:16:37] [I] min: 1.17065 ms (end to end 1.63037 ms)
[09/21/2021-20:16:37] [I] max: 1.21362 ms (end to end 1.69461 ms)
[09/21/2021-20:16:37] [I] mean: 1.20674 ms (end to end 1.6848 ms)
[09/21/2021-20:16:37] [I] median: 1.20654 ms (end to end 1.6853 ms)
[09/21/2021-20:16:37] [I] percentile: 1.20856 ms at 99% (end to end 1.68738 ms at 99%)
[09/21/2021-20:16:37] [I] throughput: 0 qps
[09/21/2021-20:16:37] [I] walltime: 3.00238 s
[09/21/2021-20:16:37] [I] Enqueue Time
[09/21/2021-20:16:37] [I] min: 0.00756836 ms
[09/21/2021-20:16:37] [I] max: 0.0177612 ms
[09/21/2021-20:16:37] [I] median: 0.00878906 ms
[09/21/2021-20:16:37] [I] GPU Compute
[09/21/2021-20:16:37] [I] min: 0.0787354 ms
[09/21/2021-20:16:37] [I] max: 0.0838013 ms
[09/21/2021-20:16:37] [I] mean: 0.0807835 ms
[09/21/2021-20:16:37] [I] median: 0.0804443 ms
[09/21/2021-20:16:37] [I] percentile: 0.0820923 ms at 99%
[09/21/2021-20:16:37] [I] total compute time: 0.276199 s
B=3
[09/21/2021-20:17:27] [I] Average on 10 runs - GPU latency: 0.117114 ms - Host latency: 1.80073 ms (end to end 2.55527 ms, enqueue 0.00908203 ms)
[09/21/2021-20:17:27] [I] Host Latency
[09/21/2021-20:17:27] [I] min: 1.74707 ms (end to end 2.5 ms)
[09/21/2021-20:17:27] [I] max: 1.80957 ms (end to end 2.56079 ms)
[09/21/2021-20:17:27] [I] mean: 1.80084 ms (end to end 2.55501 ms)
[09/21/2021-20:17:27] [I] median: 1.80066 ms (end to end 2.55554 ms)
[09/21/2021-20:17:27] [I] percentile: 1.80298 ms at 99% (end to end 2.55713 ms at 99%)
[09/21/2021-20:17:27] [I] throughput: 0 qps
[09/21/2021-20:17:27] [I] walltime: 3.00303 s
[09/21/2021-20:17:27] [I] Enqueue Time
[09/21/2021-20:17:27] [I] min: 0.00805664 ms
[09/21/2021-20:17:27] [I] max: 0.0180664 ms
[09/21/2021-20:17:27] [I] median: 0.00915527 ms
[09/21/2021-20:17:27] [I] GPU Compute
[09/21/2021-20:17:27] [I] min: 0.114868 ms
[09/21/2021-20:17:27] [I] max: 0.120102 ms
[09/21/2021-20:17:27] [I] mean: 0.117101 ms
[09/21/2021-20:17:27] [I] median: 0.116943 ms
[09/21/2021-20:17:27] [I] percentile: 0.118896 ms at 99%
[09/21/2021-20:17:27] [I] total compute time: 0.267577 s
B=4
[09/21/2021-20:17:57] [I] Average on 10 runs - GPU latency: 0.159546 ms - Host latency: 2.4011 ms (end to end 3.42803 ms, enqueue 0.0090332 ms)
[09/21/2021-20:17:57] [I] Host Latency
[09/21/2021-20:17:57] [I] min: 2.33008 ms (end to end 3.35645 ms)
[09/21/2021-20:17:57] [I] max: 2.40906 ms (end to end 3.43738 ms)
[09/21/2021-20:17:57] [I] mean: 2.40099 ms (end to end 3.42803 ms)
[09/21/2021-20:17:57] [I] median: 2.40112 ms (end to end 3.42847 ms)
[09/21/2021-20:17:57] [I] percentile: 2.40332 ms at 99% (end to end 3.43005 ms at 99%)
[09/21/2021-20:17:57] [I] throughput: 0 qps
[09/21/2021-20:17:57] [I] walltime: 3.0056 s
[09/21/2021-20:17:57] [I] Enqueue Time
[09/21/2021-20:17:57] [I] min: 0.00805664 ms
[09/21/2021-20:17:57] [I] max: 0.0178833 ms
[09/21/2021-20:17:57] [I] median: 0.00915527 ms
[09/21/2021-20:17:57] [I] GPU Compute
[09/21/2021-20:17:57] [I] min: 0.157593 ms
[09/21/2021-20:17:57] [I] max: 0.162079 ms
[09/21/2021-20:17:57] [I] mean: 0.159489 ms
[09/21/2021-20:17:57] [I] median: 0.159668 ms
[09/21/2021-20:17:57] [I] percentile: 0.161682 ms at 99%
[09/21/2021-20:17:57] [I] total compute time: 0.273842 s
B=8
[09/21/2021-20:18:30] [I] Average on 10 runs - GPU latency: 0.312866 ms - Host latency: 4.78594 ms (end to end 6.91038 ms, enqueue 0.00964355 ms)
[09/21/2021-20:18:30] [I] Host Latency
[09/21/2021-20:18:30] [I] min: 4.64575 ms (end to end 6.76855 ms)
[09/21/2021-20:18:30] [I] max: 4.79071 ms (end to end 6.91711 ms)
[09/21/2021-20:18:30] [I] mean: 4.78564 ms (end to end 6.91054 ms)```
[09/21/2021-20:18:30] [I] median: 4.78589 ms (end to end 6.91125 ms)
[09/21/2021-20:18:30] [I] percentile: 4.78845 ms at 99% (end to end 6.91321 ms at 99%)
[09/21/2021-20:18:30] [I] throughput: 0 qps
[09/21/2021-20:18:30] [I] walltime: 3.01061 s
[09/21/2021-20:18:30] [I] Enqueue Time
[09/21/2021-20:18:30] [I] min: 0.00842285 ms
[09/21/2021-20:18:30] [I] max: 0.0207214 ms
[09/21/2021-20:18:30] [I] median: 0.00952148 ms
[09/21/2021-20:18:30] [I] GPU Compute
[09/21/2021-20:18:30] [I] min: 0.309448 ms
[09/21/2021-20:18:30] [I] max: 0.317429 ms
[09/21/2021-20:18:30] [I] mean: 0.312851 ms
[09/21/2021-20:18:30] [I] median: 0.31308 ms
[09/21/2021-20:18:30] [I] percentile: 0.31543 ms at 99%
[09/21/2021-20:18:30] [I] total compute time: 0.269365 s
B=64
[09/21/2021-20:19:19] [I] Average on 10 runs - GPU latency: 2.46104 ms - Host latency: 38.059 ms (end to end 55.6042 ms, enqueue 0.0102051 ms)
[09/21/2021-20:19:19] [I] Host Latency
[09/21/2021-20:19:19] [I] min: 37.0532 ms (end to end 54.5974 ms)
[09/21/2021-20:19:19] [I] max: 38.1917 ms (end to end 55.7484 ms)
[09/21/2021-20:19:19] [I] mean: 38.1627 ms (end to end 55.7096 ms)
[09/21/2021-20:19:19] [I] median: 38.1709 ms (end to end 55.7176 ms)
[09/21/2021-20:19:19] [I] percentile: 38.1875 ms at 99% (end to end 55.7404 ms at 99%)
[09/21/2021-20:19:19] [I] throughput: 0 qps
[09/21/2021-20:19:19] [I] walltime: 3.09588 s
[09/21/2021-20:19:19] [I] Enqueue Time
[09/21/2021-20:19:19] [I] min: 0.00939941 ms
[09/21/2021-20:19:19] [I] max: 0.0240479 ms
[09/21/2021-20:19:19] [I] median: 0.0102539 ms
[09/21/2021-20:19:19] [I] GPU Compute
[09/21/2021-20:19:19] [I] min: 2.45557 ms
[09/21/2021-20:19:19] [I] max: 2.46692 ms
[09/21/2021-20:19:19] [I] mean: 2.46089 ms
[09/21/2021-20:19:19] [I] median: 2.46118 ms
[09/21/2021-20:19:19] [I] percentile: 2.46533 ms at 99%
[09/21/2021-20:19:19] [I] total compute time: 0.270698 s
Our team is looking into this issue. Could you please elaborate more below statements.
Model runs batch-agnostic and 10x faster when using torch2trt TRT_Module.
Can tensorrt models be made batch-size agnostic? what about concurrency-agnostic?
What does “batch-size agnostic” mean in this context? Meaning building and engine with minShapes==optShapes==maxShapes? Or something else?
batch agnostic meaning same latency as batch size increases: as long as there is gpu compute available, the model should run a forward pass in the same amount of time regardless of batch size. for example, this is how pytorch works and how trt models work in torch2trt.
He has asked for torch2trt reproduction code and model. I’m attaching a zip which contains “write_bilinear.py” which writes for the bilinear model I shared above:
trt .engine file
torch2trt “TRTModule” .pth file
.onnx file
It also contains all those models (.engine, .pth, .onnx) created on my machine: torch==1.9.0, nvidia-tensorrt==7.2.3.4, torch2trt=0.3.0
It will also profile the TRTModule inference latency average over 1000 runs. This is the program output on my T4:
batch size 1: 0.057 ms
batch size 2: 0.057 ms
batch size 3: 0.057 ms
batch size 4: 0.057 ms
batch size 8: 0.057 ms
batch size 64: 0.058 ms