NV-Link Setup Troubleshooting and NV-Link Status Output Help

bweider · February 28, 2020, 3:58pm

We have been noticing some odd behavior when trying to configure one of our servers (running CentOS 7) for NV-Link using two GV100 GPUs. It appears that two of the links between the GPUs are responding as inactive as shown in the nvidia-smi nv-link status shown below.

Based on the individual link speed (~25 GB/s) it appears we are utilizing NVLink 2.0 but when looking at the bidirectional bandwidth, reported by the p2pBandwidthTest, it appears that we are only getting (~140 GB/s) which mimics NVLink 1.0 speeds when we should be getting ~300 GB/s over NVLink 2.0 .

Please advise what the correct output of nvidia-smi and p2pBandwidthTest should look like for 2 GPUs that have a correctly configured NVLink 2.0 connection?

NV-Link Status reported from nvidia-smi for our two GV100 GPUs:

$nvidia-smi nvlink -s
 
GPU 0: Quadro GV100 (UUID: GPU-6c950f3b-d765-c14a-0f81-5ca6be0a81a7)
Link 0: 25.781 GB/s
Link 1: <inactive>
Link 2: 25.781 GB/s
Link 3: 25.781 GB/s
GPU 1: Quadro GV100 (UUID: GPU-fb5e90b3-f1e1-78fb-8f7e-aef576e48a09)
Link 0: <inactive>
Link 1: 25.781 GB/s
Link 2: 25.781 GB/s
Link 3: 25.781 GB/s

$nvidia-smi nvlink -c
 
GPU 0: Quadro GV100 (UUID: GPU-6c950f3b-d765-c14a-0f81-5ca6be0a81a7)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: false
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: false
GPU 1: Quadro GV100 (UUID: GPU-fb5e90b3-f1e1-78fb-8f7e-aef576e48a09)
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true   
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: false
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: false

Running the Peer-to-Peer Bandwidth Latency test provided in CUDA Utilities on two GV100 GPU’s:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Quadro GV100, pciBusID: 3b, pciDeviceID: 0, pciDomainID:0
Device: 1, Quadro GV100, pciBusID: d8, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
   D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 548.63  10.43
     1  10.64 552.51
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 548.63  72.27
     1  72.27 552.51
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 557.64  18.78
     1  18.65 560.04
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 560.84 143.71
     1 140.14 561.65
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.87  18.34
     1  18.23   2.27
 
   CPU     0      1
     0   4.02  11.83
     1  12.05   5.07
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.87   1.91
     1   2.02   2.26

   CPU     0      1
     0   4.06   3.33
     1   3.43   5.04

manuelwoellhaf · September 16, 2021, 11:36am

For all those who desperately end up here. For us it was the NVLink not all the way in the socket. Pressing with a little more force did the trick.

_Bi2022 · April 11, 2023, 9:36am

I have a similar problem with two NVIDIA GeForce RTX 3090 cards and nvlink. However, pressing hard is not working.

rs277 · April 11, 2023, 7:05pm

There is a recent post regarding the lottery that is P2P connectivity here, if you’ve not already seen it.

epk · April 11, 2023, 7:46pm

to be fair, my post was about the lottery of whether P2P would be enabled at all, not about NVLink speed.

But the principle of NVIDIA expecting us to fill but the “lottery card” first then wait for the results is indeed similar.

_Bi2022 · April 12, 2023, 10:14am

Thanks @rs277, @epk. I actually invested a good amount of time and money on these RTX 3090 cards and nvlink bridge in the hope of seamless better computation power. This should be a smooth configuration, not so much hassle. So, the question remains,

Can two rtx3090 be connected over nvlink bridge? The answer should be yes because each of the rtx3090 has nvlink connector on it
The nvlink is definitely not working on my system, because I cannot see it on nvidia control panel, or in nvidia-smi command, or in GPU-Z tool, which should be easily visible
How can I do it? May be @Robert_Crovella can give a better answer.

rs277 · April 12, 2023, 6:50pm

Yes, but whether it works or not is subject to all the vagaries Robert outlined in the thread linked above.

It seems apparent that in order for P2P/NVlink to function correctly, there is quite a complex chain of dependencies, (driver, motherboard hardware, correct BIOS configuration of said hardware, number of PCIe lanes, etc), that need to be satisfied in order for this to function reliably.

It’s somewhat disappointing that Nvidia have not emphasised this more and saved a lot of people wasted time and money. I wonder if this is the reason NVLink is no longer offered on Geforce cards, given Tesla/Quadro customers are probably more likely to be buying turn-key systems where this functionality is specified.

Later: You might find the BIOS setting info in this thread helpful.

_Bi2022 · April 13, 2023, 10:49am

Thanks @rs277, the motherboard issue was not in my head at all. I bought the latest motherboard and did not think whether it would support sli/nvlink at all. I just saw it can fit two RTX, that’s it. Now I feel really stupid.

I have ASUS ROG MAXIMUS Z690 HERO EVA, and google says it does not support sli. But one thing, when I am running nvidia-smi nvlink -s without nvlink

GPU 0: NVIDIA GeForce RTX 3090 
NVML: Unable to retrieve NVLink information as all links are inactive
GPU 1: NVIDIA GeForce RTX 3090 
NVML: Unable to retrieve NVLink information as all links are inactive

And when I am running nvidia-smi nvlink -s with nvlink

GPU 0: NVIDIA GeForce RTX 3090 
         Link 0: 14.062 GB/s
         Link 1: 14.062 GB/s
         Link 2: 14.062 GB/s
         Link 3: 14.062 GB/s
GPU 1: NVIDIA GeForce RTX 3090 
         Link 0: 14.062 GB/s
         Link 1: 14.062 GB/s
         Link 2: 14.062 GB/s
         Link 3: 14.062 GB/s

It means, the nvidia-smi can still detect the nvlink bridge, however cannot further transmit data,

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

Topic		Replies	Views
RTX 3090 + NVLink + CUDA P2P - not working on Linux or Windows, in different ways? CUDA Programming and Performance	9	7055	May 24, 2023
Two/Dual GeForce RTX 3090s and NVLink: Ubuntu support? At least Blender has support for NVLink Linux cuda , ubuntu	14	10389	April 16, 2023
No P2P - Dual NVlinked RTX 2080 TI setup on HP workstation - NVLink not working, no SLI option in control panel CUDA Setup and Installation	6	2223	December 26, 2019
Ubuntu - NVLink not working with two RTX 3090 GPU - Hardware	5	1838	February 6, 2023
Using an A6000 2-Slot NVLink bracket on Geforce 3090 Linux	8	3184	February 24, 2021
Compatibility of NVLink bridges OptiX	4	3413	June 14, 2022
NVlink not working RTX 3090 / Ubuntu 16.04 Linux	1	2127	March 2, 2022
How to enable P2P access? CUDA Setup and Installation cuda	3	4140	February 6, 2023
Issue with P2P connection using two RTX A4500 CUDA Programming and Performance cuda , ubuntu	7	2368	March 31, 2023
Help with Nvlink and a pair of 2080 ti? CUDA Setup and Installation	0	1222	May 8, 2020

NV-Link Setup Troubleshooting and NV-Link Status Output Help

Related topics