TensorRT Inference server low performance with 8 GPUs

sidney.rubidge · September 5, 2019, 5:58pm

I’m running TensorRT Inference Server on a Lambda Blade (https://lambdalabs.com/products/blade). It has 8 Titan RTX GPUs, but I am only getting a marginal improvement of performance over using 1 Titan RTX GPU. 1 X GPU processes approx 1.1 batches/second, whereas 8 X GPUs process approx 1.5 batches a second.

I have 40 preprocessing workers each running in their own Docker container on the same computer as the TRTIS container. The preprocessing workers prepare a batch of 270 images which are then sent to the TRTIS using gRPC.

When using one GPU, the GPU is pretty much constantly maxed out. When using eight GPUs, each GPU is loaded every now and then (usually one or two at a time), else they sit at 0%.

I am currently using TRTIS container 19.08 (nvcr.io/nvidia/tensorrtserver:19.08-py3), but I experienced the same results on 19.07 and 19.06.

The TRTIS is run with docker compose with the following options:

--grpc-infer-thread-count=64
--grpc-stream-infer-thread-count=64
shm_size: 2g
memlock: -1
stack: 67108864
network_mode: host

From the system’s metrics (for 8GPUs):

Average inference round trip time experienced by each preprocessing worker: 13s
TRTIS queue time: Max 15ms, usually less than 10ms
TRTIS compute time: Max 1.5s, usually around 1s

System specifications:

Ubuntu 18.04
512 GB RAM
8 X Titan RTX
2 X Intel Xeon Gold 6230 (20 cores/40 threads each)
Docker version 18.09.8, build 0dd43dd87f (I'm using a slightly older version of docker with nvidia-docker2 so that I can use nvidia runtime in docker-compose)

Is there a TRTIS internal worker count or something that I am overlooking?

David_Goodwin · September 9, 2019, 11:05pm

Thanks for the report. We are looking into multi-GPU performance issues right now to see if there is a problem. We also fixed a significant GRPC issue that could be limiting your performance. The fix will be in the 19.09 release (or you can build your own server from master to try it now).

As a WAR you can try launching 8 TRTIS processes, each of which can see a single GPU (use NVIDIA_VISIBLE_DEVICES in your docker args, or CUDA_VISIBLE_DEVICES within the container) and see if that provides better scaling.

sidney.rubidge · September 10, 2019, 11:51am

Thank you @David Goodwin. We’ll run multiple TRTIS processes for now.

Topic		Replies	Views
Tensor RT server with GPU only instances high CPU usage Triton Inference Server - archived	4	2476	February 27, 2020
TF-TRT5: How to run tensorflow-tensorrt inferences with multiple GPUs TensorRT	10	3605	September 3, 2019
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2500	March 30, 2023
How to increase TensorRT GPU utilization for lots of requests? TensorRT tensorrt	3	801	January 28, 2021
Real Time Inference with Multi GPU - Multiple Model Triton Inference Server - archived	1	1396	January 29, 2020
TRTIS Tesla M60 performance issues (TensorRT model) Triton Inference Server - archived	4	1284	August 20, 2019
Triton inference speed test TensorRT cudnn	1	386	December 22, 2023
TensorRT Inference Server system RAM usage climbs until container is closed by OS Triton Inference Server - archived	2	975	June 23, 2019
Optimal Trt inference using threads/processes for peoplenet model for Triton Inference Server - archived tensorrt , inference-server-triton , a100	1	1156	July 30, 2021
Help with increasing performance on TensorRT Inference Server TensorRT	0	405	August 19, 2019

TensorRT Inference server low performance with 8 GPUs

Related topics