Using NVlink bridge makes big impact on the training speed on 2x RTX 2080 (multi GPU training with p2p)

I thought it would be nice to share my experience with installing NVlink bridge.

We have a server with 8x RTX 2080Ti cards. I was experimenting with how well the training is scaling with increasing number of GPUs. The results I found on the internet are reported when batch size is increased with number of GPUs. That is if 1 GPU is used batch size is 64, then 2 it is 128 and so on. However in DNN training big batch sizes not always bring best results and I was interested to keep batch size 64 while distribution calculations across several GPUs. The results are greatly depend on the DNN model used. The bigger the model the less advantage to run across multiple GPUs. Also DNN framework plays very big part in the speed.

In my experiments I found the Nvidia branch of caffe and PyTorch give best results (beating tensorflow and mxnet). This is expected results and confirmed by many publications on the internet.

What I did not expect is that using NVlink bridge will make VERY significant impact. Here are results using nvcaffe:
single GPU 450 images/second
dual GPU via single PCIe switch 535 images/second
dual GPU via NVLinks (enabling P2P) 830 images/second

The model I train has massive last layer in order of 200K-300K outputs, so I believe this dictates lots of data need to be copied between GPUs hence fast link makes such impact.



Do you have a simple example of PyTorch code that actually uses NVLink? For example, looking at this simple demo code, what needs to change to use NVLink? (assuming multiple cards are available, and linked via NVLink)

Hi Andrei,

The experiments I did with P2P were for caffe framework. Basically, the caffe was build using libraries which could benefit from having fast data exchange between GPU cards via NVlink (P2P). There is nothing specifically in protobuf configuration files you have to do to enable this in caffe all is done behind the scenes.

With PyTorch I would assume the same is true. Namely, if PyTorch knows how to use data excahnge directly between GPU cards (without copying to CPU memory) then all should work without the need for explicit python statements. I think with all modern builds of PyTorch P2P data exchange is enabled by default.

Here are some useful links for you

Hope this helps,