megan1
September 14, 2021, 5:06pm
1
Description
We are running a trt model with tensorrt_platform on triton inference server. When using the perf_analyzer we see linear increase in latency when we increase batch size:
batch size 1: 49.4 infer/sec, latency 20488 usec
batch size 2: 49.2 infer/sec, latency 41419 usec
batch size 3: 48.6 infer/sec, latency 62209 usec
we also see near-linear increase in latency when we add concurrency:
Concurrency: 1, throughput: 51.4 infer/sec, latency 20192 usec
Concurrency: 2, throughput: 60.4 infer/sec, latency 33602 usec
Concurrency: 3, throughput: 59.8 infer/sec, latency 51003 usec
Concurrency: 4, throughput: 59 infer/sec, latency 68353 usec
I have increased max batch size to 8G, there is no change in outcome.
Model runs batch-agnostic and 10x faster when using torch2trt TRT_Module.
similar issues without (non-generic) answer:
Hi all,
I encounter the following issue: increasing the batch size leads to a proportional increase in latency.
I’m using TRT 5.1.5.0, C++ API, and converted the network from UFF.
Inference times:
Batch size 1: 12.7ms
Batch size 2: 25.2ms
Batch size 3: 37.5ms
However, the SDK documentation implies that increasing the batch size should not have large impact on the latency. The documenation states: Often the time taken to compute results for batch size N=1 is almost identical to batch sizes…
hello, for tensorrt serving, my config.pbtxt is:
name: "my_model"
platform: "tensorrt_plan"
max_batch_size: 10
input [
{
name: "input_images"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 1376, 800 ]
}
]
output [
{
name: "feature_fusion/Conv_7/Sigmoid"
data_type: TYPE_FP32
dims: [ 344, 200, 1]
}
]
instance_group [
{
kind: KIND_GPU,
count: 1
}
]
and when I use
build/perf_client -m my_model -d -c10 -l2000 -p1000 -b1 -v
to …
Can tensorrt models be made batch-size agnostic? what about concurrency-agnostic?
Environment
TensorRT Version : 7.2.3.4
GPU Type : T4 (same issue occurs on V100)
Nvidia Driver Version : 450.119.03
CUDA Version : 11.3
Operating System + Version : Ubuntu 18.04
Container (if container which image + tag) : tritonserver:21.05-py3
Relevant Files
we are running a auto-encoder model on 360x640 images, cannot share here due to IP
Steps To Reproduce
export model to tensorrt (using onnx2trt or torch2trt - same outcome with both), with dynamic batch size
use triton tensorrt_plan platform
run perf_analyzer on trt model
megan1
September 14, 2021, 5:52pm
2
The issue persists when running tensorrt outside of triton. when I run tensorrt inference in python directly using tensorrt and pycuda inference time:
Batch size 1: 0.036710262298583984
Batch size 2: 0.06481266021728516
Batch size 3: 0.09190487861633301
megan1
September 24, 2021, 7:43pm
3
solved the concurrency issue when properly using instance groups on a small model. batching is still an issue, but related to tensorrt, not triton. marking as solved.
nadeemm
Closed
October 1, 2021, 3:00pm
4
This topic was automatically closed after 6 days. New replies are no longer allowed.