NVLink error 74 fatal error detected

Hi,
I have a system with P100 NVLink *4, don’t know when and how there’s a NVLink error code 74 even freshly reboot the system and no workload is running. As a result, GPU 0<->1 suppose are interconnected via NVlink but now it degrade as PCIe.

Can anybody help: what’s the root cause? does it mean HW fault or just sw/driver issue? any solution to fix it? many thanks.

Env:
GPU: P100-SXM2 16GB *4, see topo below, issue happens on NVLink 3 which connect GPU0 and GPU1
Ubuntu Linux 16.04, kernel: 4.4.0-98-generic
Most recent driver: 384.90. CUDA 8.0
Most recent vbios: P100_PCN204260.bin

reboot, run nothing, dmesg report that VNlink error. Error code: 74, means nvlink hardware/driver/bus error

[ 6.270401] NVRM: GPU at PCI:0000:04:00: GPU-c0654425-de20-8455-c301-e8503e61cfe3
[ 6.270417] NVRM: GPU Board Serial Number: 0321217216336
[ 6.270420] NVRM: Xid (PCI:0000:04:00): 74, NVLink: fatal error detected on link 3(0x0, 0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) <<<====

frank@T4130:~$ nvidia-smi
Thu Nov 23 17:00:19 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2… Off | 00000000:04:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 16276MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla P100-SXM2… Off | 00000000:06:00.0 Off | 0 |
| N/A 31C P0 39W / 300W | 0MiB / 16276MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla P100-SXM2… Off | 00000000:07:00.0 Off | 0 |
| N/A 29C P0 41W / 300W | 0MiB / 16276MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla P100-SXM2… Off | 00000000:08:00.0 Off | 0 |
| N/A 31C P0 37W / 300W | 0MiB / 16276MiB | 2% Default |
±------------------------------±---------------------±---------------------+

frank@T4130:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
GPU0 X PIX NV1 NV2 PIX 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU1 PIX X NV2 NV1 PIX 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU2 NV1 NV2 X NV1 PIX 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU3 NV2 NV1 NV1 X PIX 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
mlx5_0 PIX PIX PIX PIX X

suppose it shall be NVLink between GPU0 and GPU1, but now it reported as PIX(PCIe switch)

Is this in a Dell C4130?

What is the output of

nvidia-smi -a

?

yes, 4130 config G. with 4 GPU internally inter-connected by NVLink. any suggestions? thanks a lot.

here’re nvida-smi -a output.

==============NVSMI LOG==============

Timestamp : Mon Nov 27 13:11:58 2017
Driver Version : 384.90

Attached GPUs : 4
GPU 00000000:04:00.0
Product Name : Tesla P100-SXM2-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321217216336
GPU UUID : GPU-c0654425-de20-8455-c301-e8503e61cfe3
Minor Number : 0
VBIOS Version : 86.00.41.00.05
MultiGPU Board : No
Board ID : 0x400
GPU Part Number : 900-2H403-0100-000
Inforom Version
Image Version : H403.0201.00.04
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x15F910DE
Bus Id : 00000000:04:00.0
Sub System Id : 0x116B10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 16276 MiB
Used : 0 MiB
Free : 16276 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 7
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 7
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 1
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 32 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 30.56 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1328 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1328 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1480 MHz
SM : 1480 MHz
Memory : 715 MHz
Video : 1480 MHz
Max Customer Boost Clocks
Graphics : 1480 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

GPU 00000000:06:00.0
Product Name : Tesla P100-SXM2-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321017091950
GPU UUID : GPU-1302e07d-0336-3c7d-bffc-987f65ffe2da
Minor Number : 1
VBIOS Version : 86.00.41.00.05
MultiGPU Board : No
Board ID : 0x600
GPU Part Number : 900-2H403-0100-000
Inforom Version
Image Version : H403.0201.00.04
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x06
Device : 0x00
Domain : 0x0000
Device Id : 0x15F910DE
Bus Id : 00000000:06:00.0
Sub System Id : 0x116B10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 16276 MiB
Used : 0 MiB
Free : 16276 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 30.06 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1328 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1328 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1480 MHz
SM : 1480 MHz
Memory : 715 MHz
Video : 1480 MHz
Max Customer Boost Clocks
Graphics : 1480 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

GPU 00000000:07:00.0
Product Name : Tesla P100-SXM2-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321217216332
GPU UUID : GPU-f879fe4a-4879-b7f1-a212-64caf962d15b
Minor Number : 2
VBIOS Version : 86.00.41.00.05
MultiGPU Board : No
Board ID : 0x700
GPU Part Number : 900-2H403-0100-000
Inforom Version
Image Version : H403.0201.00.04
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x07
Device : 0x00
Domain : 0x0000
Device Id : 0x15F910DE
Bus Id : 00000000:07:00.0
Sub System Id : 0x116B10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 16276 MiB
Used : 0 MiB
Free : 16276 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 29 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 32.02 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1328 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1328 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1480 MHz
SM : 1480 MHz
Memory : 715 MHz
Video : 1480 MHz
Max Customer Boost Clocks
Graphics : 1480 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

GPU 00000000:08:00.0
Product Name : Tesla P100-SXM2-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321217216379
GPU UUID : GPU-60f8bcbd-2f6f-8e93-34ee-e837ad543b04
Minor Number : 3
VBIOS Version : 86.00.41.00.05
MultiGPU Board : No
Board ID : 0x800
GPU Part Number : 900-2H403-0100-000
Inforom Version
Image Version : H403.0201.00.04
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x08
Device : 0x00
Domain : 0x0000
Device Id : 0x15F910DE
Bus Id : 00000000:08:00.0
Sub System Id : 0x116B10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 16276 MiB
Used : 0 MiB
Free : 16276 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 31 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 32.02 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1328 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1328 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1480 MHz
SM : 1480 MHz
Memory : 715 MHz
Video : 1480 MHz
Max Customer Boost Clocks
Graphics : 1480 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

hi
any update? thanks

I didn’t spot anything. You may want to discuss it with Dell. If it were me, before I did that, I would make sure I had the latest system BIOS installed on the machine, and also possibly reload the OS and system software to make sure nothing was corrupted there.