TensorRT Inference server low performance with 8 GPUs

I’m running TensorRT Inference Server on a Lambda Blade (https://lambdalabs.com/products/blade). It has 8 Titan RTX GPUs, but I am only getting a marginal improvement of performance over using 1 Titan RTX GPU. 1 X GPU processes approx 1.1 batches/second, whereas 8 X GPUs process approx 1.5 batches a second.

I have 40 preprocessing workers each running in their own Docker container on the same computer as the TRTIS container. The preprocessing workers prepare a batch of 270 images which are then sent to the TRTIS using gRPC.

When using one GPU, the GPU is pretty much constantly maxed out. When using eight GPUs, each GPU is loaded every now and then (usually one or two at a time), else they sit at 0%.

I am currently using TRTIS container 19.08 (nvcr.io/nvidia/tensorrtserver:19.08-py3), but I experienced the same results on 19.07 and 19.06.

The TRTIS is run with docker compose with the following options:

  • --grpc-infer-thread-count=64
  • --grpc-stream-infer-thread-count=64
  • shm_size: 2g
  • memlock: -1
  • stack: 67108864
  • network_mode: host

From the system’s metrics (for 8GPUs):

  • Average inference round trip time experienced by each preprocessing worker: 13s
  • TRTIS queue time: Max 15ms, usually less than 10ms
  • TRTIS compute time: Max 1.5s, usually around 1s

System specifications:

  • Ubuntu 18.04
  • 512 GB RAM
  • 8 X Titan RTX
  • 2 X Intel Xeon Gold 6230 (20 cores/40 threads each)
  • Docker version 18.09.8, build 0dd43dd87f (I'm using a slightly older version of docker with nvidia-docker2 so that I can use nvidia runtime in docker-compose)

Is there a TRTIS internal worker count or something that I am overlooking?

Thanks for the report. We are looking into multi-GPU performance issues right now to see if there is a problem. We also fixed a significant GRPC issue that could be limiting your performance. The fix will be in the 19.09 release (or you can build your own server from master to try it now).

As a WAR you can try launching 8 TRTIS processes, each of which can see a single GPU (use NVIDIA_VISIBLE_DEVICES in your docker args, or CUDA_VISIBLE_DEVICES within the container) and see if that provides better scaling.

Thank you @David Goodwin. We’ll run multiple TRTIS processes for now.