Nvidia Driver not loaded

Hello ,

we are facing an issue with our GPU Nodes , can you please help to resolve it ?

RHEL 8.1
Kernel 4.18.0-147.8.1.el8_1.x86_64
Nvidia 440.64.00
Cuda 10.2

Node : HPE XA780I SINGLE 2-SOCKET INTEL XEON SCALABLE W/4 SXM2 GPUS NVLINKED NODE BLADE
GPU : Nvidia Tesla V100

it start with an issue reported by nvidia-smi

[root@r10i1n6 ~]# nvidia-smi
Thu Sep 3 08:28:31 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:1A:00.0 Off | 0 |
| N/A 44C P0 45W / 300W | 3MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:1C:00.0 Off | 0 |
| N/A 43C P0 ERR! / 300W | 3MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:88:00.0 Off | 0 |
| N/A 43C P0 ERR! / 300W | 3MiB / 16160MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:8A:00.0 Off | 0 |
| N/A 44C P0 46W / 300W | 0MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+

We see in ERR! On the GPU ID1 and GPU ID2

±--------------------------±---------------------------------------------------------+
| Health Monitor Report |
+===========================+==========================================================+
| Overall Health | Failure |
| GPU | |
| → 1 | Failure |
| → Errors | |
| → NVLINK system | Failure |
| | GPU 1’s NvLink link 1 is currently down Run a field |
| | diagnostic on the GPU. |
| → Power system | Warning |
| | Cannot reliably read the power usage for GPU 1. |
| → 2 | Failure |
| → Errors | |
| → NVLINK system | Failure |
| | GPU 2’s NvLink link 1 is currently down Run a field |
| | diagnostic on the GPU. |
| → Power system | Warning |
| | Cannot reliably read the power usage for GPU 2. |

After launching a DCGMI test, the node got stuck. I’ve done an ON/OFF to restart it :

[root@r10i1n6 ~]# dcgmi health -c
±--------------------------±---------------------------------------------------------+
| Health Monitor Report |
+===========================+==========================================================+
| Overall Health | Failure |
| GPU | |
| → 1 | Failure |
| → Errors | |
| → NVLINK system | Failure |
| | GPU 1’s NvLink link 1 is currently down Run a field |
| | diagnostic on the GPU. |
| → 2 | Failure |
| → Errors | |
| → NVLINK system | Failure |
| | GPU 2’s NvLink link 1 is currently down Run a field |
| | diagnostic on the GPU. |
±--------------------------±---------------------------------------------------------+

I performed the swapp of GPU as follows:

GPU id0 <-> GPU id1.
GPU id2 <-> GPU id3.

then i got this issue :

[root@r10i1n6 ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the
NVIDIA driver. Make sure that the latest NVIDIA driver is installed
and running.

[root@r10i1n6 ~]# lspci | grep HFI
1b:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)
5e:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)
89:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)
d8:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)

[root@r10i1n6 ~]# opahfirev
######################
r10i1n6 - HFI 0000:1b:00.0
HFI: Driver not Loaded
Board: UNKNOWN
SN: UNKNOWN
Location:Discrete Socket:0 PCISlot:00 NUMANode:0 HFI_NA
Bus: Speed 5GT/s, Width x16
GUID: UNKNOWN
SiRev:
TMM: UNKNOWN
######################
######################
r10i1n6 - HFI 0000:5e:00.0
HFI: Driver not Loaded
Board: UNKNOWN
SN: UNKNOWN
Location:Discrete Socket:0 PCISlot:00 NUMANode:0 HFI_NA
Bus: Speed 5GT/s, Width x16
GUID: UNKNOWN
SiRev:
TMM: UNKNOWN
######################
######################
r10i1n6 - HFI 0000:89:00.0
HFI: Driver not Loaded
Board: UNKNOWN
SN: UNKNOWN
Location:Discrete Socket:1 PCISlot:00 NUMANode:1 HFI_NA
Bus: Speed 5GT/s, Width x16
GUID: UNKNOWN
SiRev:
TMM: UNKNOWN
######################
######################
r10i1n6 - HFI 0000:d8:00.0
HFI: Driver not Loaded
Board: UNKNOWN
SN: UNKNOWN
Location:Discrete Socket:1 PCISlot:00 NUMANode:1 HFI_NA
Bus: Speed 5GT/s, Width x16
GUID: UNKNOWN
SiRev:
TMM: UNKNOWN
######################

[root@r10i1n6 ~]# lspci | grep -i NV
[root@r10i1n6 ~]#
[root@r10i1n6 ~]# lspci | grep -i PLX
18:00.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
19:00.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
19:04.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
19:08.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
86:00.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
87:04.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
87:08.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
87:0c.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)

conclusion :

NV not detected with lspci
driver HFI loaded
we reboot the node
we reset the GPU : same issue

and we have this problem in 2 other nodes.

is it a known issue?

Thank you in advance !