TRTIS Tesla M60 performance issues (TensorRT model)

Hello,

I am running a TensorRT model on Tesla M60 cards (Amazon g3.16xlarge instance).
And I am experiencing weird TRTIS behavior. The model, that runs on 4 GPUs has only 25% FPS improvement over the one, that runs on 2 GPUs.
Does anyone know, what could be a bottleneck here?

I checked the same model on GTX 1070, and the performance doubles when doubling the number of GPUs.

What version of TRTIS are you using?
M60 is compute capability 5.2, so it is not an officially support device. But given that your model runs correctly on a single M60 (I assume) then that is likely not an issue.

When you say “runs on 4 GPUs” and “runs on 2 GPUs” do you mean that you create 4 instances, one on each GPU and compare that to 2 model instance, on 2 GPUs?

I am using TRTIS from nvcr.io/nvidia/tensorrtserver:19.04-py3 (version 19.04 is based on NVIDIA TensorRT Inference Server 1.1.0)

By running on X GPU I mean, that the same amount of instances is created per GPU.
I compared 2 vs 4 by running 1 model instance per GPU and then 3 model instances.
The FPS results were:

1 instance & 2 GPUs - 4.2
1 instance & 4 GPUs - 6.6

3 instances & 2 GPUs - 5.8
3 instances & 4 GPUs - 6.9

And what are the officially supported devices?
I could not find this information in the docs…

Release notes (GPU requirements section): Release Notes :: NVIDIA Deep Learning Triton Inference Server Documentation
Support matrix: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html

You can find links to both in the Documentation section of the github README.

It is difficult to say with certainty what is causing your (apparent) performance bottleneck. Could be PCI bandwidth, CPU, etc.