TRTIS Tesla M60 performance issues (TensorRT model)

pivocleanslider · July 31, 2019, 5:24pm

Hello,

I am running a TensorRT model on Tesla M60 cards (Amazon g3.16xlarge instance).
And I am experiencing weird TRTIS behavior. The model, that runs on 4 GPUs has only 25% FPS improvement over the one, that runs on 2 GPUs.
Does anyone know, what could be a bottleneck here?

I checked the same model on GTX 1070, and the performance doubles when doubling the number of GPUs.

David_Goodwin · August 2, 2019, 10:48pm

What version of TRTIS are you using?
M60 is compute capability 5.2, so it is not an officially support device. But given that your model runs correctly on a single M60 (I assume) then that is likely not an issue.

When you say “runs on 4 GPUs” and “runs on 2 GPUs” do you mean that you create 4 instances, one on each GPU and compare that to 2 model instance, on 2 GPUs?

pivocleanslider · August 16, 2019, 9:21am

I am using TRTIS from nvcr.io/nvidia/tensorrtserver:19.04-py3 (version 19.04 is based on NVIDIA TensorRT Inference Server 1.1.0)

By running on X GPU I mean, that the same amount of instances is created per GPU.
I compared 2 vs 4 by running 1 model instance per GPU and then 3 model instances.
The FPS results were:

1 instance & 2 GPUs - 4.2
1 instance & 4 GPUs - 6.6

3 instances & 2 GPUs - 5.8
3 instances & 4 GPUs - 6.9

pivocleanslider · August 16, 2019, 1:05pm

And what are the officially supported devices?
I could not find this information in the docs…

David_Goodwin · August 20, 2019, 7:30pm

Release notes (GPU requirements section): Release Notes :: NVIDIA Deep Learning Triton Inference Server Documentation
Support matrix: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html

You can find links to both in the Documentation section of the github README.

It is difficult to say with certainty what is causing your (apparent) performance bottleneck. Could be PCI bandwidth, CPU, etc.