Hello,
I am using the 4-slot RTX NVLINK bridge along with two RTX 3090 cards. In both Windows and Linux, it seems that it’s not quite working (with CUDA 11.8).
On Ubuntu 20.04, driver 520.61.05, nvidia-smi nvlink seems to indicate that the NVLink connections are present but down. The p2pBandwidthLatencyTest example indicates that peer-to-peer access is working … but the actual P2P bandwidth is so slow (<0.01 GB/s) that the example hangs.
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NV4 0-11 N/A
GPU1 NV4 X 0-11 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
$ nvidia-smi nvlink -s
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-3d99eb33-dec9-0db3-e357-c6df76bd8363)
NVML: Unable to retrieve NVLink information as all links are inActive
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-612b2086-7c2c-adfa-9b66-cef79e941f0d)
NVML: Unable to retrieve NVLink information as all links are inActive
$ nvidia-smi nvlink -c
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-3d99eb33-dec9-0db3-e357-c6df76bd8363)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-612b2086-7c2c-adfa-9b66-cef79e941f0d)
$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 10, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 25, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 809.59 1.28
1 1.42 831.56
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
^C
(... example hangs, P2P bandwidth << 0.1 GB/s)
On Windows 10 Pro 64-bit, driver 526.47, nvidia-smi nvlink suggests the link is running, except for “Link is supported: false”, and CUDA fails to detect P2P access.
C:\Windows\system32>nvidia-smi.exe nvlink -s
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-1d4eb0a5-cd7c-a08e-3614-1d784dfb3cf3)
Link 0: 14.062 GB/s
Link 1: 14.062 GB/s
Link 2: 14.062 GB/s
Link 3: 14.062 GB/s
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-612b2086-7c2c-adfa-9b66-cef79e941f0d)
Link 0: 14.062 GB/s
Link 1: 14.062 GB/s
Link 2: 14.062 GB/s
Link 3: 14.062 GB/s
C:\Windows\system32>nvidia-smi.exe nvlink -c
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-1d4eb0a5-cd7c-a08e-3614-1d784dfb3cf3)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: false
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: false
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-612b2086-7c2c-adfa-9b66-cef79e941f0d)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: false
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: false
C:\Users\Dev\src\nvidia-cuda-samples\bin\win64\Release>p2pBandwidthLatencyTest.exe
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 10, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 25, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0
... (output truncated)
What is the underlying problem here?
Of note: one card is connected via PCIe 3.0 x16, and the other via PCIe 2.0 x4. No other issues with this configuration - bandwidthTest reports the expected PCIe up/down throughputs.