I thought it would be nice to share my experience with installing NVlink bridge.
We have a server with 8x RTX 2080Ti cards. I was experimenting with how well the training is scaling with increasing number of GPUs. The results I found on the internet are reported when batch size is increased with number of GPUs. That is if 1 GPU is used batch size is 64, then 2 it is 128 and so on. However in DNN training big batch sizes not always bring best results and I was interested to keep batch size 64 while distribution calculations across several GPUs. The results are greatly depend on the DNN model used. The bigger the model the less advantage to run across multiple GPUs. Also DNN framework plays very big part in the speed.
In my experiments I found the Nvidia branch of caffe and PyTorch give best results (beating tensorflow and mxnet). This is expected results and confirmed by many publications on the internet.
What I did not expect is that using NVlink bridge will make VERY significant impact. Here are results using nvcaffe:
single GPU 450 images/second
dual GPU via single PCIe switch 535 images/second
dual GPU via NVLinks (enabling P2P) 830 images/second
The model I train has massive last layer in order of 200K-300K outputs, so I believe this dictates lots of data need to be copied between GPUs hence fast link makes such impact.