CentOS 8/Driver 440.33 Tesla V100: nvidia-smi reports error 62

Hello,

I am running CentOS 8 on a DELL C4140 with 4 NVIDIA Tesla V100 GPUs:

# nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-ca51cbd1-4eb0-f265-9cb6-613f848d9ebd)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-d02431f6-3ad3-e257-2858-8e830e06fa56)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-2c34530c-3db6-3cd2-df6b-39f65d3448c2)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-c8740a89-14af-b1f3-d17c-629cb2454737)

The installed NVIDIA driver version 440.33 seems to load fine. The first call of

nvidia-smi

reports the four GPUs as follows:

# nvidia-smi
Tue Dec 17 15:07:11 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   36C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                    0 |
| N/A   34C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:1D:00.0 Off |                    0 |
| N/A   32C    P0    54W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:1E:00.0 Off |                    0 |
| N/A   34C    P0    58W / 300W |      0MiB / 32510MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Calling

nvidia-smi

again results in a Linux kernel crash, and an immediate reboot of the system.

The error logs contain the following message:

NVRM: GPU at PCI:0000:1e:00: GPU-c8740a89-14af-b1f3-d17c-629cb2454737
NVRM: GPU Board Serial Number: 0324518173584
NVRM: Xid (PCI:0000:1e:00): 62, pid=10230, 0a76(2d50) 00000000 00000000

According to the list of Xid errors (see https://docs.nvidia.com/deploy/xid-errors/index.html), error 62 refers to an internal micro-controller halt (newer drivers), and is related to hardware error, driver error or thermal issue.

The tests with previous NVIDIA drivers showed three GPUs, only. Still, error 62 is a bit unspecific, and I wonder how to find out if the 4th GPU is broken, or it is a driver problem.

Running

nvidia-bug-report.sh

partly worked because the script did not run completely, and resulted in an immediate system reboot. Using the additional option

--safe-mode

worked fine, instead. I enclose both log files.

Any help on that is appreciated.

Thanking in advance,

Frank
nvidia-bug-report.log-20191218-1.gz (59.7 KB)
nvidia-bug-report-safe-mode.log.gz (73.5 KB)

For unknown reasons, the Xserver is not starting so it’s starting/stopping in fast succession:

/usr/libexec/gdm-x-session[13357]: Unable to run X server

to check if the nvidia gpu is involved please disable X, enable nvidia-persistenced to start on boot and then test the gpus using e.g. gpu-burn.

Hello generix,

thanks for your proposal. Enabling nvidia-persistenced helped to stabilize the system. Now we can run our tests :)

Frank

Not sure what error it this but I got the following problem:

For some reason one of my gpu shows error state ERR!

±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:61:00.0 Off | 8 |
| N/A 34C P0 ERR! / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:62:00.0 Off | 0 |
| N/A 32C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:89:00.0 Off | 0 |
| N/A 32C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:8A:00.0 Off | 0 |
| N/A 34C P0 44W / 300W | 0MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

nvidia-smi commands take at least 10 seconds or more. I rebooted the node it came back and everything looks ok, is there any place or information you want me to gather right now after the reboot or any other information I should send when this happens again?