Nvidia Driver not loaded

thameur.mejri · October 23, 2020, 6:33am

Hello ,

we are facing an issue with our GPU Nodes , can you please help to resolve it ?

RHEL 8.1
Kernel 4.18.0-147.8.1.el8_1.x86_64
Nvidia 440.64.00
Cuda 10.2

Node : HPE XA780I SINGLE 2-SOCKET INTEL XEON SCALABLE W/4 SXM2 GPUS NVLINKED NODE BLADE
GPU : Nvidia Tesla V100

it start with an issue reported by nvidia-smi

We see in ERR! On the GPU ID1 and GPU ID2

After launching a DCGMI test, the node got stuck. I’ve done an ON/OFF to restart it :

I performed the swapp of GPU as follows:

GPU id0 <-> GPU id1.
GPU id2 <-> GPU id3.

then i got this issue :

[root@r10i1n6 ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the
NVIDIA driver. Make sure that the latest NVIDIA driver is installed
and running.

[root@r10i1n6 ~]# lspci | grep HFI
1b:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)
5e:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)
89:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)
d8:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)

[root@r10i1n6 ~]# opahfirev
######################
r10i1n6 - HFI 0000:1b:00.0
HFI: Driver not Loaded
Board: UNKNOWN
SN: UNKNOWN
Location:Discrete Socket:0 PCISlot:00 NUMANode:0 HFI_NA
Bus: Speed 5GT/s, Width x16
GUID: UNKNOWN
SiRev:
TMM: UNKNOWN
######################
######################
r10i1n6 - HFI 0000:5e:00.0
HFI: Driver not Loaded
Board: UNKNOWN
SN: UNKNOWN
Location:Discrete Socket:0 PCISlot:00 NUMANode:0 HFI_NA
Bus: Speed 5GT/s, Width x16
GUID: UNKNOWN
SiRev:
TMM: UNKNOWN
######################
######################
r10i1n6 - HFI 0000:89:00.0
HFI: Driver not Loaded
Board: UNKNOWN
SN: UNKNOWN
Location:Discrete Socket:1 PCISlot:00 NUMANode:1 HFI_NA
Bus: Speed 5GT/s, Width x16
GUID: UNKNOWN
SiRev:
TMM: UNKNOWN
######################
######################
r10i1n6 - HFI 0000:d8:00.0
HFI: Driver not Loaded
Board: UNKNOWN
SN: UNKNOWN
Location:Discrete Socket:1 PCISlot:00 NUMANode:1 HFI_NA
Bus: Speed 5GT/s, Width x16
GUID: UNKNOWN
SiRev:
TMM: UNKNOWN
######################

[root@r10i1n6 ~]# lspci | grep -i NV
[root@r10i1n6 ~]#
[root@r10i1n6 ~]# lspci | grep -i PLX
18:00.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
19:00.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
19:04.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
19:08.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
86:00.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
87:04.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
87:08.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)
87:0c.0 PCI bridge: PLX Technology, Inc. Device 8764 (rev ab)

conclusion :

NV not detected with lspci
driver HFI loaded
we reboot the node
we reset the GPU : same issue

and we have this problem in 2 other nodes.

is it a known issue?

Thank you in advance !

Topic		Replies	Views
Installing driver fails for Tesla V100 Linux	3	3676	October 12, 2021
Nvidia-smi "No devices were found" Linux kernel , ubuntu , driver	8	1276	July 23, 2024
Nvidia command cannot see second GPU CUDA Setup and Installation cuda , ubuntu , nvbugs	1	2098	August 30, 2022
RESOLVED!!! \| GPU missing from nvidia-smi but seen in lspci CUDA Setup and Installation	9	12753	April 11, 2024
GPU not detected by nvidia-smi Linux	0	181	July 31, 2024
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	435	September 11, 2024
Missing GPU Linux	5	1846	October 12, 2021
Ubuntu 22.04 installation driver error Nvidia[A10] Linux	4	2955	May 22, 2024
Nvidia Driver is not working on Ubuntu 22 Linux ubuntu , driver	14	19195	October 28, 2022
P100 Issues on EL6/7 - /proc/driver/nvidia/gpus/XX/information output is ?? and can't run X Linux	6	2728	October 14, 2021

Nvidia Driver not loaded

Related topics