CentOS 8/Driver 440.33 Tesla V100: nvidia-smi reports error 62

frank.hofmann · December 18, 2019, 1:03pm

Hello,

I am running CentOS 8 on a DELL C4140 with 4 NVIDIA Tesla V100 GPUs:

# nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-ca51cbd1-4eb0-f265-9cb6-613f848d9ebd)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-d02431f6-3ad3-e257-2858-8e830e06fa56)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-2c34530c-3db6-3cd2-df6b-39f65d3448c2)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-c8740a89-14af-b1f3-d17c-629cb2454737)

The installed NVIDIA driver version 440.33 seems to load fine. The first call of

nvidia-smi

reports the four GPUs as follows:

# nvidia-smi
Tue Dec 17 15:07:11 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   36C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                    0 |
| N/A   34C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:1D:00.0 Off |                    0 |
| N/A   32C    P0    54W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:1E:00.0 Off |                    0 |
| N/A   34C    P0    58W / 300W |      0MiB / 32510MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Calling

nvidia-smi

again results in a Linux kernel crash, and an immediate reboot of the system.

The error logs contain the following message:

NVRM: GPU at PCI:0000:1e:00: GPU-c8740a89-14af-b1f3-d17c-629cb2454737
NVRM: GPU Board Serial Number: 0324518173584
NVRM: Xid (PCI:0000:1e:00): 62, pid=10230, 0a76(2d50) 00000000 00000000

According to the list of Xid errors (see https://docs.nvidia.com/deploy/xid-errors/index.html), error 62 refers to an internal micro-controller halt (newer drivers), and is related to hardware error, driver error or thermal issue.

The tests with previous NVIDIA drivers showed three GPUs, only. Still, error 62 is a bit unspecific, and I wonder how to find out if the 4th GPU is broken, or it is a driver problem.

Running

nvidia-bug-report.sh

partly worked because the script did not run completely, and resulted in an immediate system reboot. Using the additional option

--safe-mode

worked fine, instead. I enclose both log files.

Any help on that is appreciated.

Thanking in advance,

Frank
nvidia-bug-report.log-20191218-1.gz (59.7 KB)
nvidia-bug-report-safe-mode.log.gz (73.5 KB)

generix · December 18, 2019, 11:24pm

For unknown reasons, the Xserver is not starting so it’s starting/stopping in fast succession:

/usr/libexec/gdm-x-session[13357]: Unable to run X server

to check if the nvidia gpu is involved please disable X, enable nvidia-persistenced to start on boot and then test the gpus using e.g. gpu-burn.

frank.hofmann · December 19, 2019, 1:16pm

Hello generix,

thanks for your proposal. Enabling nvidia-persistenced helped to stabilize the system. Now we can run our tests :)

Frank

gerardo8a · January 14, 2020, 7:06pm

Not sure what error it this but I got the following problem:

For some reason one of my gpu shows error state ERR!

nvidia-smi commands take at least 10 seconds or more. I rebooted the node it came back and everything looks ok, is there any place or information you want me to gather right now after the reboot or any other information I should send when this happens again?

Topic		Replies	Views
RESOLVED!!! \| GPU missing from nvidia-smi but seen in lspci CUDA Setup and Installation	9	12982	April 11, 2024
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	155	December 19, 2024
Missing GPU Linux	5	1852	October 12, 2021
P100 Issues on EL6/7 - /proc/driver/nvidia/gpus/XX/information output is ?? and can't run X Linux	6	2733	October 14, 2021
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	498	September 11, 2024
Nvidia Driver not loaded Linux	0	591	October 23, 2020
Cant open nvidia-setting ubuntu 18.04 , vgpu Nvidia Tesla M10 Linux	6	1866	July 16, 2019
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10242	October 14, 2021
Failed to run deviceQuery - cuda 10.2 Tesla V100 CUDA Setup and Installation	1	3739	December 2, 2019
Intermittent "No devices were found" on CentOS 7 CUDA Setup and Installation	9	2469	December 7, 2021

CentOS 8/Driver 440.33 Tesla V100: nvidia-smi reports error 62

Related topics