I am trying to run DIGITS (which is at the end caffe) on machine with 8x RTX 2080Ti cards. However the speed is much slower than on machine with 8x GTX 1080Ti cards.
After a bit if digging I can see that topology looks good
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X PIX PIX PIX SYS SYS SYS SYS 0-19,40-59
GPU1 PIX X PIX PIX SYS SYS SYS SYS 0-19,40-59
GPU2 PIX PIX X PIX SYS SYS SYS SYS 0-19,40-59
GPU3 PIX PIX PIX X SYS SYS SYS SYS 0-19,40-59
GPU4 SYS SYS SYS SYS X PIX PIX PIX 20-39,60-79
GPU5 SYS SYS SYS SYS PIX X PIX PIX 20-39,60-79
GPU6 SYS SYS SYS SYS PIX PIX X PIX 20-39,60-79
GPU7 SYS SYS SYS SYS PIX PIX PIX X 20-39,60-79
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
However, there is no peer to peer access between any of cards. I used deviceQuery tool from CUDA samples, which is calling cudaDeviceCanAccessPeer(&can_access_peer, gpuid[i], gpuid[j]).
There is also a thread https://devtalk.nvidia.com/default/topic/1043300/linux/2080-tis-cudadevicecanaccesspeer-failure-without-nvlink-bridge/ which suggests that P2P access for RTX 2080Ti cards can only be done via NVLink bridge, but officially it is not confirmed.
I can try buying NVLink bridge, but it can only connect 2 cards.
Can anyone point me to official NVidia position regarding P2P access between RTX 2080Ti cards via PCIe bus? P2P over PCIe works fine for GTX 1080Ti cards in my other machine.
Thanks for prompt response. I have the same case as in the thread you suggested.
nvidia-smi topo -p2p r
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X CNS CNS CNS CNS CNS CNS CNS
GPU1 CNS X CNS CNS CNS CNS CNS CNS
GPU2 CNS CNS X CNS CNS CNS CNS CNS
GPU3 CNS CNS CNS X CNS CNS CNS CNS
GPU4 CNS CNS CNS CNS X CNS CNS CNS
GPU5 CNS CNS CNS CNS CNS X CNS CNS
GPU6 CNS CNS CNS CNS CNS CNS X CNS
GPU7 CNS CNS CNS CNS CNS CNS CNS X
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
It is quite upsetting to realise that we invested several thousand euros in the machine I can not use.
Is there a list of motherboards or chipsets which support P2P over PCIe?
Or P2P over PCIe for RTX 2080Ti is not supported at all?
Or maybe is possible to change something in the kernel to enable the support?
Not that I know of. Furthermore this particular issue is not a motherboard or chipset issue. Please re-read the thread I linked.
Robert, sorry to be a bit pedantic here. Your post in another thread states that RTX 2080Ti can only do P2P over NVLink bridge. But this implies that you can only run 2 cards in parallel because the bridge can only connect 2 cards. Is it the case? This seems to be a massive step back from what you could do with GTX 1080Ti.
For any 2 GPUs in view here (Titan RTX, RTX 2080Ti, RTX2080) that you wish to place into a P2P relationship, those 2 GPUs must have a NVLink bridge installed between them. You cannot rely on PCIE to establish the peer relationship.
I agree that it is not possible to place more than 2 GPUs in the same P2P clique with this arrangement. (Assuming the products in view here, and assuming no changes to NVLink bridge design.) I believe it should be possible to have up to four 2-way cliques, amongst 8 GPUs, with such an arrangement, assuming you add 4 bridges pairwise amongst the GPUs. That is not the same as having all 8 GPUs participate in the same clique, however. And I have not personally tested that myself.
I agree that this is substantially different than GTX 1080Ti behavior.
Please don’t assume that just because I said a pairwise P2P arrangement might be possible that it means that I think it will provide any tangible performance benefit to your DIGITS/Caffe test case.
To be clear, I don’t think it will provide any tangible performance benefits there. You’re welcome to do as you wish of course.
I have done further investigations with caffe framework by looking at the source code and doing some experiments (I presume tensorflow will be same). The way caffe uses multiple GPU cards is by spreading batch between GPUs. That is if you have batch of 64 samples which you want to process in parallel on 4 cards than each card will be processing forward propagation stage and most of backward propagation for batch of 16 . Only tiny amount of computation (with very small amount of data) is done between cards to “merge” gradients (call to ncclAllReduce). This is where benefit of fast P2P data exchange could be theoretically noticed. So even theoretically benefit of fast P2P looks negligible for caffe framework.
What I noticed as well is, in fact, I get about 10% performance increase if I split work between GPUs which sit on different PCIe switches. I presume the possible explanation is that data exchange between CPU and GPUs is done faster when data goes via 2 PCIe switches. So the gain from faster CPU-GPU data exchange could be more than loss from slow between GPU data speed. I will test this as soon as the other machine with 1080Ti cards and PCIe P2P working will be free.
I have not done tests with NVLink bridge yet. If we finally get it/them I will post my findings.