I’m seeing this in logs:
[Thu Jul 25 01:48:20 2019][1089357.164420] NVRM: Xid (PCI:0003:01:00): 74, NVLink: failed to train link 0 to remote PCI:0008:00:00
[Thu Jul 25 01:48:20 2019][1089357.164496] NVRM: Xid (PCI:0006:01:00): 74, NVLink: failed to train link 2 to remote PCI:0009:00:01
Anyone seen this and know how to resolve this? Thanks.
It may have been an intermittent error which may be difficult to diagnose. If it is a persistent error (e.g. happens every time you boot), it most likely indicates a hardware issue. Difficult to say anything else without knowing about your setup.
So it’s happening on all 62 nodes on the cluster. POWER8 nodes, RHEL7.6 CUDA 10.1. So I don’t think it’s HW related…and it constantly repeats filling up the logs. The nodes are diskless.
I would recommend seeking support from the system vendor, and/or IBM. IBM can/will enlist the support of NVIDIA as needed.