Hi,
I have a system with P100 NVLink *4, don’t know when and how there’s a NVLink error code 74 even freshly reboot the system and no workload is running. As a result, GPU 0<->1 suppose are interconnected via NVlink but now it degrade as PCIe.
Can anybody help: what’s the root cause? does it mean HW fault or just sw/driver issue? any solution to fix it? many thanks.
Env:
GPU: P100-SXM2 16GB *4, see topo below, issue happens on NVLink 3 which connect GPU0 and GPU1
Ubuntu Linux 16.04, kernel: 4.4.0-98-generic
Most recent driver: 384.90. CUDA 8.0
Most recent vbios: P100_PCN204260.bin
reboot, run nothing, dmesg report that VNlink error. Error code: 74, means nvlink hardware/driver/bus error
[ 6.270401] NVRM: GPU at PCI:0000:04:00: GPU-c0654425-de20-8455-c301-e8503e61cfe3
[ 6.270417] NVRM: GPU Board Serial Number: 0321217216336
[ 6.270420] NVRM: Xid (PCI:0000:04:00): 74, NVLink: fatal error detected on link 3(0x0, 0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) <<<====
frank@T4130:~$ nvidia-smi
Thu Nov 23 17:00:19 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2… Off | 00000000:04:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 16276MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla P100-SXM2… Off | 00000000:06:00.0 Off | 0 |
| N/A 31C P0 39W / 300W | 0MiB / 16276MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla P100-SXM2… Off | 00000000:07:00.0 Off | 0 |
| N/A 29C P0 41W / 300W | 0MiB / 16276MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla P100-SXM2… Off | 00000000:08:00.0 Off | 0 |
| N/A 31C P0 37W / 300W | 0MiB / 16276MiB | 2% Default |
±------------------------------±---------------------±---------------------+
frank@T4130:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
GPU0 X PIX NV1 NV2 PIX 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU1 PIX X NV2 NV1 PIX 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU2 NV1 NV2 X NV1 PIX 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU3 NV2 NV1 NV1 X PIX 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
mlx5_0 PIX PIX PIX PIX X
suppose it shall be NVLink between GPU0 and GPU1, but now it reported as PIX(PCIe switch)