Using multiple RTX 2080 Ti cards in parallel not possible?

I am trying to run DIGITS (which is at the end caffe) on machine with 8x RTX 2080Ti cards. However the speed is much slower than on machine with 8x GTX 1080Ti cards.

After a bit if digging I can see that topology looks good

# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity
GPU0     X      PIX     PIX     PIX     SYS     SYS     SYS     SYS     0-19,40-59
GPU1    PIX      X      PIX     PIX     SYS     SYS     SYS     SYS     0-19,40-59
GPU2    PIX     PIX      X      PIX     SYS     SYS     SYS     SYS     0-19,40-59
GPU3    PIX     PIX     PIX      X      SYS     SYS     SYS     SYS     0-19,40-59
GPU4    SYS     SYS     SYS     SYS      X      PIX     PIX     PIX     20-39,60-79
GPU5    SYS     SYS     SYS     SYS     PIX      X      PIX     PIX     20-39,60-79
GPU6    SYS     SYS     SYS     SYS     PIX     PIX      X      PIX     20-39,60-79
GPU7    SYS     SYS     SYS     SYS     PIX     PIX     PIX      X      20-39,60-79

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

However, there is no peer to peer access between any of cards. I used deviceQuery tool from CUDA samples, which is calling cudaDeviceCanAccessPeer(&can_access_peer, gpuid[i], gpuid[j]).

There is also a thread https://devtalk.nvidia.com/default/topic/1043300/linux/2080-tis-cudadevicecanaccesspeer-failure-without-nvlink-bridge/ which suggests that P2P access for RTX 2080Ti cards can only be done via NVLink bridge, but officially it is not confirmed.

I can try buying NVLink bridge, but it can only connect 2 cards.

Can anyone point me to official NVidia position regarding P2P access between RTX 2080Ti cards via PCIe bus? P2P over PCIe works fine for GTX 1080Ti cards in my other machine.

This thread may be of interest:

https://devtalk.nvidia.com/default/topic/1046951/cuda-programming-and-performance/does-titan-rtx-support-p2p-access-w-o-nvlink-/

Hi Robert,

Thanks for prompt response. I have the same case as in the thread you suggested.

nvidia-smi topo -p2p r
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
 GPU0   X       CNS     CNS     CNS     CNS     CNS     CNS     CNS
 GPU1   CNS     X       CNS     CNS     CNS     CNS     CNS     CNS
 GPU2   CNS     CNS     X       CNS     CNS     CNS     CNS     CNS
 GPU3   CNS     CNS     CNS     X       CNS     CNS     CNS     CNS
 GPU4   CNS     CNS     CNS     CNS     X       CNS     CNS     CNS
 GPU5   CNS     CNS     CNS     CNS     CNS     X       CNS     CNS
 GPU6   CNS     CNS     CNS     CNS     CNS     CNS     X       CNS
 GPU7   CNS     CNS     CNS     CNS     CNS     CNS     CNS     X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

It is quite upsetting to realise that we invested several thousand euros in the machine I can not use.

Is there a list of motherboards or chipsets which support P2P over PCIe?
Or P2P over PCIe for RTX 2080Ti is not supported at all?
Or maybe is possible to change something in the kernel to enable the support?

Not that I know of. Furthermore this particular issue is not a motherboard or chipset issue. Please re-read the thread I linked.

Robert, sorry to be a bit pedantic here. Your post in another thread states that RTX 2080Ti can only do P2P over NVLink bridge. But this implies that you can only run 2 cards in parallel because the bridge can only connect 2 cards. Is it the case? This seems to be a massive step back from what you could do with GTX 1080Ti.

For any 2 GPUs in view here (Titan RTX, RTX 2080Ti, RTX2080) that you wish to place into a P2P relationship, those 2 GPUs must have a NVLink bridge installed between them. You cannot rely on PCIE to establish the peer relationship.

I agree that it is not possible to place more than 2 GPUs in the same P2P clique with this arrangement. (Assuming the products in view here, and assuming no changes to NVLink bridge design.) I believe it should be possible to have up to four 2-way cliques, amongst 8 GPUs, with such an arrangement, assuming you add 4 bridges pairwise amongst the GPUs. That is not the same as having all 8 GPUs participate in the same clique, however. And I have not personally tested that myself.

I agree that this is substantially different than GTX 1080Ti behavior.

Please don’t assume that just because I said a pairwise P2P arrangement might be possible that it means that I think it will provide any tangible performance benefit to your DIGITS/Caffe test case.

To be clear, I don’t think it will provide any tangible performance benefits there. You’re welcome to do as you wish of course.

I have done further investigations with caffe framework by looking at the source code and doing some experiments (I presume tensorflow will be same). The way caffe uses multiple GPU cards is by spreading batch between GPUs. That is if you have batch of 64 samples which you want to process in parallel on 4 cards than each card will be processing forward propagation stage and most of backward propagation for batch of 16 . Only tiny amount of computation (with very small amount of data) is done between cards to “merge” gradients (call to ncclAllReduce). This is where benefit of fast P2P data exchange could be theoretically noticed. So even theoretically benefit of fast P2P looks negligible for caffe framework.

What I noticed as well is, in fact, I get about 10% performance increase if I split work between GPUs which sit on different PCIe switches. I presume the possible explanation is that data exchange between CPU and GPUs is done faster when data goes via 2 PCIe switches. So the gain from faster CPU-GPU data exchange could be more than loss from slow between GPU data speed. I will test this as soon as the other machine with 1080Ti cards and PCIe P2P working will be free.

I have not done tests with NVLink bridge yet. If we finally get it/them I will post my findings.