I’m running TensorRT Inference Server on a Lambda Blade (https://lambdalabs.com/products/blade). It has 8 Titan RTX GPUs, but I am only getting a marginal improvement of performance over using 1 Titan RTX GPU. 1 X GPU processes approx 1.1 batches/second, whereas 8 X GPUs process approx 1.5 batches a second.
I have 40 preprocessing workers each running in their own Docker container on the same computer as the TRTIS container. The preprocessing workers prepare a batch of 270 images which are then sent to the TRTIS using gRPC.
When using one GPU, the GPU is pretty much constantly maxed out. When using eight GPUs, each GPU is loaded every now and then (usually one or two at a time), else they sit at 0%.
I am currently using TRTIS container 19.08 (nvcr.io/nvidia/tensorrtserver:19.08-py3), but I experienced the same results on 19.07 and 19.06.
The TRTIS is run with docker compose with the following options:
- --grpc-infer-thread-count=64
- --grpc-stream-infer-thread-count=64
- shm_size: 2g
- memlock: -1
- stack: 67108864
- network_mode: host
From the system’s metrics (for 8GPUs):
- Average inference round trip time experienced by each preprocessing worker: 13s
- TRTIS queue time: Max 15ms, usually less than 10ms
- TRTIS compute time: Max 1.5s, usually around 1s
System specifications:
- Ubuntu 18.04
- 512 GB RAM
- 8 X Titan RTX
- 2 X Intel Xeon Gold 6230 (20 cores/40 threads each)
- Docker version 18.09.8, build 0dd43dd87f (I'm using a slightly older version of docker with nvidia-docker2 so that I can use nvidia runtime in docker-compose)
Is there a TRTIS internal worker count or something that I am overlooking?