NVLink errors

agiordan · July 29, 2019, 2:55pm

I’m seeing this in logs:

[Thu Jul 25 01:48:20 2019][1089357.164420] NVRM: Xid (PCI:0003:01:00): 74, NVLink: failed to train link 0 to remote PCI:0008:00:00
[Thu Jul 25 01:48:20 2019][1089357.164496] NVRM: Xid (PCI:0006:01:00): 74, NVLink: failed to train link 2 to remote PCI:0009:00:01

Anyone seen this and know how to resolve this? Thanks.

Robert_Crovella · July 29, 2019, 2:59pm

It may have been an intermittent error which may be difficult to diagnose. If it is a persistent error (e.g. happens every time you boot), it most likely indicates a hardware issue. Difficult to say anything else without knowing about your setup.

agiordan · July 29, 2019, 3:09pm

So it’s happening on all 62 nodes on the cluster. POWER8 nodes, RHEL7.6 CUDA 10.1. So I don’t think it’s HW related…and it constantly repeats filling up the logs. The nodes are diskless.

Robert_Crovella · July 29, 2019, 3:16pm

I would recommend seeking support from the system vendor, and/or IBM. IBM can/will enlist the support of NVIDIA as needed.

Topic		Replies	Views
NVLink error 74 fatal error detected CUDA Setup and Installation	4	3727	December 1, 2017
Failed to start nvidia-fabricmanager.service on centos8 DGX Systems (Data Center) cuda , nvbugs , python	0	295	July 31, 2024
NVLink Not Active on Quadro RTX A5000 Pair Despite Physical Connection Linux	0	333	August 26, 2024
NV-Link Setup Troubleshooting and NV-Link Status Output Help CUDA Setup and Installation	7	12920	April 13, 2023
GPU sporadically crashing with NVLink fatal error detected on link 0 Xid 74 Linux	3	1181	April 14, 2020
Kernel: NVRM: Xid: 74, name=<unknown>, NVLink: fatal error detected on link; rmmod: ERROR: could not remove 'nvidia_uvm': Resource temporarily unavail Linux cuda , kernel	0	521	May 10, 2024
HGX 8GPU A100 (80G) NVlink systems show NVlink Fatal Error after NVSwitch Temperature CUDA Programming and Performance	1	476	July 16, 2024
NVLink with different model gaming cards? CUDA Setup and Installation	0	534	June 15, 2020
Why nvlink not effect? General Discussion	0	739	June 16, 2023
Program compiled with HPCX failed to use NVLink in NCCL function nvc, nvc++ and nvfortran hpc-x	1	134	March 3, 2025

NVLink errors

Related topics