Using multiple RTX 2080 Ti cards in parallel not possible?

I have done further investigations with caffe framework by looking at the source code and doing some experiments (I presume tensorflow will be same). The way caffe uses multiple GPU cards is by spreading batch between GPUs. That is if you have batch of 64 samples which you want to process in parallel on 4 cards than each card will be processing forward propagation stage and most of backward propagation for batch of 16 . Only tiny amount of computation (with very small amount of data) is done between cards to “merge” gradients (call to ncclAllReduce). This is where benefit of fast P2P data exchange could be theoretically noticed. So even theoretically benefit of fast P2P looks negligible for caffe framework.

What I noticed as well is, in fact, I get about 10% performance increase if I split work between GPUs which sit on different PCIe switches. I presume the possible explanation is that data exchange between CPU and GPUs is done faster when data goes via 2 PCIe switches. So the gain from faster CPU-GPU data exchange could be more than loss from slow between GPU data speed. I will test this as soon as the other machine with 1080Ti cards and PCIe P2P working will be free.

I have not done tests with NVLink bridge yet. If we finally get it/them I will post my findings.