RTX 3090 + NVLink + CUDA P2P - not working on Linux or Windows, in different ways?

Hello,

I am using the 4-slot RTX NVLINK bridge along with two RTX 3090 cards. In both Windows and Linux, it seems that it’s not quite working (with CUDA 11.8).

On Ubuntu 20.04, driver 520.61.05, nvidia-smi nvlink seems to indicate that the NVLink connections are present but down. The p2pBandwidthLatencyTest example indicates that peer-to-peer access is working … but the actual P2P bandwidth is so slow (<0.01 GB/s) that the example hangs.

$ nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	NV4	0-11		N/A
GPU1	NV4	 X 	0-11		N/A

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

$ nvidia-smi nvlink -s
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-3d99eb33-dec9-0db3-e357-c6df76bd8363)
NVML: Unable to retrieve NVLink information as all links are inActive
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-612b2086-7c2c-adfa-9b66-cef79e941f0d)
NVML: Unable to retrieve NVLink information as all links are inActive

$ nvidia-smi nvlink -c
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-3d99eb33-dec9-0db3-e357-c6df76bd8363)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-612b2086-7c2c-adfa-9b66-cef79e941f0d)

$ ./p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 10, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 25, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 809.59   1.28 
     1   1.42 831.56 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
^C
(... example hangs, P2P bandwidth << 0.1 GB/s)

On Windows 10 Pro 64-bit, driver 526.47, nvidia-smi nvlink suggests the link is running, except for “Link is supported: false”, and CUDA fails to detect P2P access.

C:\Windows\system32>nvidia-smi.exe nvlink -s
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-1d4eb0a5-cd7c-a08e-3614-1d784dfb3cf3)
         Link 0: 14.062 GB/s
         Link 1: 14.062 GB/s
         Link 2: 14.062 GB/s
         Link 3: 14.062 GB/s
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-612b2086-7c2c-adfa-9b66-cef79e941f0d)
         Link 0: 14.062 GB/s
         Link 1: 14.062 GB/s
         Link 2: 14.062 GB/s
         Link 3: 14.062 GB/s

C:\Windows\system32>nvidia-smi.exe nvlink -c
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-1d4eb0a5-cd7c-a08e-3614-1d784dfb3cf3)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
         Link 1, P2P is supported: true
         Link 1, Access to system memory supported: true
         Link 1, P2P atomics supported: true
         Link 1, System memory atomics supported: true
         Link 1, SLI is supported: true
         Link 1, Link is supported: false
         Link 2, P2P is supported: true
         Link 2, Access to system memory supported: true
         Link 2, P2P atomics supported: true
         Link 2, System memory atomics supported: true
         Link 2, SLI is supported: true
         Link 2, Link is supported: false
         Link 3, P2P is supported: true
         Link 3, Access to system memory supported: true
         Link 3, P2P atomics supported: true
         Link 3, System memory atomics supported: true
         Link 3, SLI is supported: true
         Link 3, Link is supported: false
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-612b2086-7c2c-adfa-9b66-cef79e941f0d)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
         Link 1, P2P is supported: true
         Link 1, Access to system memory supported: true
         Link 1, P2P atomics supported: true
         Link 1, System memory atomics supported: true
         Link 1, SLI is supported: true
         Link 1, Link is supported: false
         Link 2, P2P is supported: true
         Link 2, Access to system memory supported: true
         Link 2, P2P atomics supported: true
         Link 2, System memory atomics supported: true
         Link 2, SLI is supported: true
         Link 2, Link is supported: false
         Link 3, P2P is supported: true
         Link 3, Access to system memory supported: true
         Link 3, P2P atomics supported: true
         Link 3, System memory atomics supported: true
         Link 3, SLI is supported: true
         Link 3, Link is supported: false

C:\Users\Dev\src\nvidia-cuda-samples\bin\win64\Release>p2pBandwidthLatencyTest.exe
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 10, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 25, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0
... (output truncated)

What is the underlying problem here?

Of note: one card is connected via PCIe 3.0 x16, and the other via PCIe 2.0 x4. No other issues with this configuration - bandwidthTest reports the expected PCIe up/down throughputs.

I have no experience with NVlink, but this Reddit thread has some comments regarding mixed PCIe width and also BIOS support for SLI.

Thanks. I’ve seen several similar threads over the last couple of years, pointing inconclusively to:

Will post more if I learn more.

Another data point: This support query on a Supermicro X10DAI motherboard shows that despite having a pair of x16 slots, it is not SLI capable.

That a serious, non gaming orientated motherboard, is not SLI capable, perhaps lends some credence to your reference above to Nvidia certification. Maybe manufacturers are required to submit boards to Nvidia for testing and approval.

Right. This is a crucial “gotcha” or point of confusion; my use case has nothing to do with SLI or gaming. I only care about NVLink, and the cards come with that.

For another motherboard example, the ASUS Pro WS WRX80E-SAGE SE supports 7x PCIe 4.0 x16 slots and has “4-way SLI” support, but again there is no way to be really certain about what will work with how many GPUs.

This V-Ray forum thread has two interesting comments:

  • “Running 7 GPUs are not gonna work out of box” …
  • “You will not be able to use NVlink with this setup, could only use it with one pair of GPUs (driver limitation that Nvidia patched a while back)”

But from the motherboard specs you would reasonably expect at least two pairs of NVLinked RTX 3090s (4 total). Is that the same thing as 4-way SLI, or is there truly a one-NVLink-bridge-per-system limit enforced by the driver with no indication to the user at all?

Yes, it does seem a bit of a mess. The few instances I’ve come across of people successfully using a pair of 3090s with NVlink under Windows, have required settings labeled “SLI”, to be enabled for it to work.

For more mental gymnastics, check out the Supermicro X10DAX, which is SLI capable: “Supports 3-way Geforce SLI (4-way SLI support for dual GPU graphics cards)”, presumably 2 cards in the x16 slots linked and 2 cards in the x8 slots linked.

For something that requires a serious investment in both cards and motherboard, you would think a bit more clarity is in order.

That said, with only RTX 3090 NVlink supported and the new RTX 4090 having none, Nvidia probably don’t really care now. Data centre cards only for NVlink and they are only supported by certified vendors.

Yep, it’s almost as if the cost of discovering this the fun way is priced into those other offerings!

You’re running a card on PCIe 2.0 x4 (dual direction speed of 1.8GB/s)

Your test returns unidirectional bandwidth between devices of :
1.28 and 1.43 GB/s

what is output for nvlink -s is not what you can utilise with your configuration, because the data fed to the gpu on 2.0 x4 still needs to have the ability to register and pass addresses of it as it’s transferred, which it can only do at 1.8GB/s on PCIe 2.0 x4.
I think, but I would love to learn more, I have the same set up, and I’ve been struggling to even get 4 links to show up, I fixed that by firmly socketing the 3090, it had a very small amount of clearance above the chipset cooler due to a water cooled front and back plate I have of it, (soon both will have the same cooling)

NVLink appears to be enabled and functioning on this system.

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 2d, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 2e, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 657.62   5.73 
     1   5.55 826.72 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 802.52  52.61 
     1  52.75 826.72 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 812.74   6.38 
     1   6.36 757.39 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 744.76  99.46 
     1  99.38 757.21 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0  11.03  95.29 
     1  91.62   7.49 

   CPU     0      1 
     0  10.00  35.16 
     1  38.35   7.01 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   9.09   6.73 
     1   6.54   8.52 

   CPU     0      1 
     0   7.79   5.31 
     1   5.87   5.89 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

C:\Windows\System32>nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 526.86       Driver Version: 526.86       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:2D:00.0  On |                  N/A |
| 53%   48C    P8    36W / 350W |   2388MiB / 24576MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ... WDDM  | 00000000:2E:00.0 Off |                  N/A |
|  0%   48C    P8    24W / 350W |   2388MiB / 24576MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
C:\Windows\System32>nvidia-smi nvlink -c
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-myserialnumber-#######)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
         Link 1, P2P is supported: true
         Link 1, Access to system memory supported: true
         Link 1, P2P atomics supported: true
         Link 1, System memory atomics supported: true
         Link 1, SLI is supported: true
         Link 1, Link is supported: false
         Link 2, P2P is supported: true
         Link 2, Access to system memory supported: true
         Link 2, P2P atomics supported: true
         Link 2, System memory atomics supported: true
         Link 2, SLI is supported: true
         Link 2, Link is supported: false
         Link 3, P2P is supported: true
         Link 3, Access to system memory supported: true
         Link 3, P2P atomics supported: true
         Link 3, System memory atomics supported: true
         Link 3, SLI is supported: true
         Link 3, Link is supported: false
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-myserialnumber-#######)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
         Link 1, P2P is supported: true
         Link 1, Access to system memory supported: true
         Link 1, P2P atomics supported: true
         Link 1, System memory atomics supported: true
         Link 1, SLI is supported: true
         Link 1, Link is supported: false
         Link 2, P2P is supported: true
         Link 2, Access to system memory supported: true
         Link 2, P2P atomics supported: true
         Link 2, System memory atomics supported: true
         Link 2, SLI is supported: true
         Link 2, Link is supported: false
         Link 3, P2P is supported: true
         Link 3, Access to system memory supported: true
         Link 3, P2P atomics supported: true
         Link 3, System memory atomics supported: true
         Link 3, SLI is supported: true
         Link 3, Link is supported: false

C:\Windows\System32>nvidia-smi -q    (REALLY USEFUL LOOK FOR THIS SECTION!!)

        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 8x

that’s with an HDMI cable out to display going in windows, but I’ll set up SSH and Linux and see what’s different.