Only two of several HGX 8GPU A100 (80G) NVlink systems show NVlink Fatal Error after NVSwitch Temperature Error as below while using GPU.
Reboot the system to return to normal, but the same symptoms continue.
I’m looking for a solution in Users code or in many ways, but it’s not working out.
Sometimes two to as many as six of the 8GPUs show up as errors (nvidia-smi)
Jul 9 20:29:50 gpu-a-091kernel: [100655.767682] nvidia-nvswitch4: SXid (PCI:0000:89:00.0): 10004, NVSWITCH Temperature 102C | TSENSE WARN Threshold 102C
Jul 9 20:29:51 gpu-a-091kernel: [100656.770169] nvidia-nvswitch4: SXid (PCI:0000:89:00.0): 10005, NVSWITCH Temperature 78C | TSENSE WARN Threshold 102C
Jul 9 20:32:05 gpu-a-091kernel: [100791.169090] nvidia-nvswitch4: SXid (PCI:0000:89:00.0): 10004, NVSWITCH Temperature 102C | TSENSE WARN Threshold 102C
Jul 9 20:32:05 gpu-a-091nv-fabricmanager: detected NVSwitch fatal error 14017 on fid 0 on NVSwitch pci bus id 00000000:89:00.0 physical id 12 port 26
Jul 9 20:32:05 gpu-a-091nv-fabricmanager: a fatal error occurred on NVSwitch port(s) 26 on NVSwitch fid 0 physical id: 12 pci bus id: 00000000:89:00.0 and requires corresponding ports reset to recover.
Jul 9 20:32:05 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 5 pci bus id 00000000:BD:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:05 gpu-a-091kernel: [100791.260904] NVRM: GPU at PCI:0000:bd:00: GPU-799ff1ff-5bbf-1929-8d8f-17df1adcdde0
Jul 9 20:32:05 gpu-a-091kernel: [100791.260908] NVRM: GPU Board Serial Number: 1
Jul 9 20:32:05 gpu-a-091kernel: [100791.260909] NVRM: Xid (PCI:0000:bd:00): 45, pid=67895, name=python, Ch 00000008
Jul 9 20:32:05 gpu-a-091kernel: [100791.305981] NVRM: Xid (PCI:0000:bd:00): 45, pid=67895, name=python, Ch 00000009
Jul 9 20:32:05 gpu-a-091kernel: [100791.306561] NVRM: Xid (PCI:0000:bd:00): 45, pid=67895, name=python, Ch 0000000a
Jul 9 20:32:05 gpu-a-091kernel: [100791.307122] NVRM: Xid (PCI:0000:bd:00): 45, pid=67895, name=python, Ch 0000000b
Jul 9 20:32:05 gpu-a-091kernel: [100791.307682] NVRM: Xid (PCI:0000:bd:00): 45, pid=67895, name=python, Ch 0000000c
Jul 9 20:32:05 gpu-a-091kernel: [100791.308244] NVRM: Xid (PCI:0000:bd:00): 45, pid=67895, name=python, Ch 0000000d
Jul 9 20:32:05 gpu-a-091kernel: [100791.308805] NVRM: Xid (PCI:0000:bd:00): 45, pid=67895, name=python, Ch 0000000e
Jul 9 20:32:05 gpu-a-091kernel: [100791.309369] NVRM: Xid (PCI:0000:bd:00): 45, pid=67895, name=python, Ch 0000000f
Jul 9 20:32:05 gpu-a-091kernel: [100791.310675] NVRM: Xid (PCI:0000:bd:00): 74, pid=‘’, name=, NVLink: fatal error detected on link 11(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
Jul 9 20:32:06 gpu-a-091kernel: [100792.171582] nvidia-nvswitch4: SXid (PCI:0000:89:00.0): 10005, NVSWITCH Temperature 76C | TSENSE WARN Threshold 102C
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: detected NVSwitch fatal error 24007 on fid 0 on NVSwitch pci bus id 00000000:89:00.0 physical id 12 port 28
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: a fatal error occurred on NVSwitch port(s) 26,28 on NVSwitch fid 0 physical id: 12 pci bus id: 00000000:89:00.0 and requires corresponding ports reset to recover.
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 5 pci bus id 00000000:BD:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 6 pci bus id 00000000:CD:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:13 gpu-a-091kernel: [100799.296886] NVRM: GPU at PCI:0000:cd:00: GPU-fbd03430-eb9f-11eb-1beb-64f57ad3bb04
Jul 9 20:32:13 gpu-a-091kernel: [100799.296891] NVRM: GPU Board Serial Number: 1
Jul 9 20:32:13 gpu-a-091kernel: [100799.296891] NVRM: Xid (PCI:0000:cd:00): 62, pid=‘’, name=, 000026f4 00002ab8 00001126 0000117a 0000274f 00029698 00000011 00000000
Jul 9 20:32:13 gpu-a-091kernel: [100799.297442] NVRM: Xid (PCI:0000:cd:00): 45, pid=67896, name=python, Ch 00000008
Jul 9 20:32:13 gpu-a-091kernel: [100799.344696] NVRM: Xid (PCI:0000:cd:00): 45, pid=67896, name=python, Ch 00000009
Jul 9 20:32:13 gpu-a-091kernel: [100799.347122] NVRM: Xid (PCI:0000:cd:00): 45, pid=67896, name=python, Ch 0000000a
Jul 9 20:32:13 gpu-a-091kernel: [100799.349536] NVRM: Xid (PCI:0000:cd:00): 45, pid=67896, name=python, Ch 0000000b
Jul 9 20:32:13 gpu-a-091kernel: [100799.351945] NVRM: Xid (PCI:0000:cd:00): 45, pid=67896, name=python, Ch 0000000c
Jul 9 20:32:13 gpu-a-091kernel: [100799.354332] NVRM: Xid (PCI:0000:cd:00): 45, pid=67896, name=python, Ch 0000000d
Jul 9 20:32:13 gpu-a-091kernel: [100799.356727] NVRM: Xid (PCI:0000:cd:00): 45, pid=67896, name=python, Ch 0000000e
Jul 9 20:32:13 gpu-a-091kernel: [100799.359133] NVRM: Xid (PCI:0000:cd:00): 45, pid=67896, name=python, Ch 0000000f
Jul 9 20:32:13 gpu-a-091kernel: [100799.362186] NVRM: Xid (PCI:0000:cd:00): 74, pid=‘’, name=, NVLink: fatal error detected on link 11(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: detected NVSwitch fatal error 24007 on fid 0 on NVSwitch pci bus id 00000000:89:00.0 physical id 12 port 11
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: a fatal error occurred on NVSwitch port(s) 11,26,28 on NVSwitch fid 0 physical id: 12 pci bus id: 00000000:89:00.0 and requires corresponding ports reset to recover.
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 4 pci bus id 00000000:9D:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 5 pci bus id 00000000:BD:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:13 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 6 pci bus id 00000000:CD:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:13 gpu-a-091kernel: [100799.387147] NVRM: GPU at PCI:0000:9d:00: GPU-7132a0f7-3be0-4cdd-7846-654594dcdf71
Jul 9 20:32:13 gpu-a-091kernel: [100799.387151] NVRM: GPU Board Serial Number: 1
Jul 9 20:32:13 gpu-a-091kernel: [100799.387152] NVRM: Xid (PCI:0000:9d:00): 62, pid=‘’, name=, 000026f4 00002ab8 00001126 0000117a 0000274f 00029698 00000011 00000000
Jul 9 20:32:13 gpu-a-091kernel: [100799.387696] NVRM: Xid (PCI:0000:9d:00): 45, pid=67894, name=python, Ch 00000008
Jul 9 20:32:13 gpu-a-091kernel: [100799.434242] NVRM: Xid (PCI:0000:9d:00): 45, pid=67894, name=python, Ch 00000009
Jul 9 20:32:13 gpu-a-091kernel: [100799.436701] NVRM: Xid (PCI:0000:9d:00): 45, pid=67894, name=python, Ch 0000000a
Jul 9 20:32:13 gpu-a-091kernel: [100799.439145] NVRM: Xid (PCI:0000:9d:00): 45, pid=67894, name=python, Ch 0000000b
Jul 9 20:32:13 gpu-a-091kernel: [100799.441578] NVRM: Xid (PCI:0000:9d:00): 45, pid=67894, name=python, Ch 0000000c
Jul 9 20:32:13 gpu-a-091kernel: [100799.444014] NVRM: Xid (PCI:0000:9d:00): 45, pid=67894, name=python, Ch 0000000d
Jul 9 20:32:13 gpu-a-091kernel: [100799.446447] NVRM: Xid (PCI:0000:9d:00): 45, pid=67894, name=python, Ch 0000000e
Jul 9 20:32:13 gpu-a-091kernel: [100799.448888] NVRM: Xid (PCI:0000:9d:00): 45, pid=67894, name=python, Ch 0000000f
Jul 9 20:32:13 gpu-a-091kernel: [100799.452109] NVRM: Xid (PCI:0000:9d:00): 74, pid=‘’, name=, NVLink: fatal error detected on link 11(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)
Jul 9 20:32:21 gpu-a-091nv-fabricmanager: detected NVSwitch fatal error 24007 on fid 0 on NVSwitch pci bus id 00000000:89:00.0 physical id 12 port 25
Jul 9 20:32:21 gpu-a-091nv-fabricmanager: a fatal error occurred on NVSwitch port(s) 11,25,26,28 on NVSwitch fid 0 physical id: 12 pci bus id: 00000000:89:00.0 and requires corresponding ports reset to recover.
Jul 9 20:32:21 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 4 pci bus id 00000000:9D:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:21 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 5 pci bus id 00000000:BD:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:21 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 6 pci bus id 00000000:CD:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:21 gpu-a-091nv-fabricmanager: NVSwitch port connected to GPU fid 0 index 7 pci bus id 00000000:DD:00.0 experienced an NVLink fatal error and requires port reset to recover. All the running CUDA jobs on this GPU will be affected.#012Resetting the specified GPU may clear the issue. Please refer to your system user guide for GPU reset instructions.
Jul 9 20:32:21 gpu-a-091kernel: [100807.419364] NVRM: GPU at PCI:0000:dd:00: GPU-74a765b4-5575-d476-3c29-aa7e50ba72f5
Jul 9 20:32:21 gpu-a-091kernel: [100807.419368] NVRM: GPU Board Serial Number: 1
Jul 9 20:32:21 gpu-a-091kernel: [100807.419369] NVRM: Xid (PCI:0000:dd:00): 62, pid=‘’, name=, 000026f4 00002ab8 00001126 0000117a 0000274f 00029698 00000011 00000000
Jul 9 20:32:51 gpu-a-091kernel: [100837.416540] NVRM: Xid (PCI:0000:dd:00): 45, pid=67897, name=python, Ch 00000008
Jul 9 20:32:51 gpu-a-091kernel: [100837.463045] NVRM: Xid (PCI:0000:dd:00): 45, pid=67897, name=python, Ch 00000009
Jul 9 20:32:51 gpu-a-091kernel: [100837.465460] NVRM: Xid (PCI:0000:dd:00): 45, pid=67897, name=python, Ch 0000000a
Jul 9 20:32:51 gpu-a-091kernel: [100837.467850] NVRM: Xid (PCI:0000:dd:00): 45, pid=67897, name=python, Ch 0000000b
Jul 9 20:32:51 gpu-a-091kernel: [100837.470237] NVRM: Xid (PCI:0000:dd:00): 45, pid=67897, name=python, Ch 0000000c
Jul 9 20:32:51 gpu-a-091kernel: [100837.472643] NVRM: Xid (PCI:0000:dd:00): 45, pid=67897, name=python, Ch 0000000d
Jul 9 20:32:51 gpu-a-091kernel: [100837.475034] NVRM: Xid (PCI:0000:dd:00): 45, pid=67897, name=python, Ch 0000000e
Jul 9 20:32:51 gpu-a-091kernel: [100837.477424] NVRM: Xid (PCI:0000:dd:00): 45, pid=67897, name=python, Ch 0000000f
Jul 9 20:32:51 gpu-a-091kernel: [100837.480448] NVRM: Xid (PCI:0000:dd:00): 74, pid=‘’, name=, NVLink: fatal error detected on link 11(0x0, 0x0, 0x10000, 0x0, 0x0, 0x0, 0x0)