We have been noticing some odd behavior when trying to configure one of our servers (running CentOS 7) for NV-Link using two GV100 GPUs. It appears that two of the links between the GPUs are responding as inactive as shown in the nvidia-smi nv-link status shown below.
Based on the individual link speed (~25 GB/s) it appears we are utilizing NVLink 2.0 but when looking at the bidirectional bandwidth, reported by the p2pBandwidthTest, it appears that we are only getting (~140 GB/s) which mimics NVLink 1.0 speeds when we should be getting ~300 GB/s over NVLink 2.0 .
Please advise what the correct output of nvidia-smi and p2pBandwidthTest should look like for 2 GPUs that have a correctly configured NVLink 2.0 connection?
NV-Link Status reported from nvidia-smi for our two GV100 GPUs:
$nvidia-smi nvlink -s
GPU 0: Quadro GV100 (UUID: GPU-6c950f3b-d765-c14a-0f81-5ca6be0a81a7)
Link 0: 25.781 GB/s
Link 1: <inactive>
Link 2: 25.781 GB/s
Link 3: 25.781 GB/s
GPU 1: Quadro GV100 (UUID: GPU-fb5e90b3-f1e1-78fb-8f7e-aef576e48a09)
Link 0: <inactive>
Link 1: 25.781 GB/s
Link 2: 25.781 GB/s
Link 3: 25.781 GB/s
$nvidia-smi nvlink -c
GPU 0: Quadro GV100 (UUID: GPU-6c950f3b-d765-c14a-0f81-5ca6be0a81a7)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: false
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: false
GPU 1: Quadro GV100 (UUID: GPU-fb5e90b3-f1e1-78fb-8f7e-aef576e48a09)
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: false
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: false
Running the Peer-to-Peer Bandwidth Latency test provided in CUDA Utilities on two GV100 GPU’s:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Quadro GV100, pciBusID: 3b, pciDeviceID: 0, pciDomainID:0
Device: 1, Quadro GV100, pciBusID: d8, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 548.63 10.43
1 10.64 552.51
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 548.63 72.27
1 72.27 552.51
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 557.64 18.78
1 18.65 560.04
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 560.84 143.71
1 140.14 561.65
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.87 18.34
1 18.23 2.27
CPU 0 1
0 4.02 11.83
1 12.05 5.07
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.87 1.91
1 2.02 2.26
CPU 0 1
0 4.06 3.33
1 3.43 5.04