Hello,
I am running CentOS 8 on a DELL C4140 with 4 NVIDIA Tesla V100 GPUs:
# nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-ca51cbd1-4eb0-f265-9cb6-613f848d9ebd)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-d02431f6-3ad3-e257-2858-8e830e06fa56)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-2c34530c-3db6-3cd2-df6b-39f65d3448c2)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-c8740a89-14af-b1f3-d17c-629cb2454737)
The installed NVIDIA driver version 440.33 seems to load fine. The first call of
nvidia-smi
reports the four GPUs as follows:
# nvidia-smi
Tue Dec 17 15:07:11 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | 0 |
| N/A 36C P0 57W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off | 0 |
| N/A 34C P0 57W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:1D:00.0 Off | 0 |
| N/A 32C P0 54W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:1E:00.0 Off | 0 |
| N/A 34C P0 58W / 300W | 0MiB / 32510MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Calling
nvidia-smi
again results in a Linux kernel crash, and an immediate reboot of the system.
The error logs contain the following message:
NVRM: GPU at PCI:0000:1e:00: GPU-c8740a89-14af-b1f3-d17c-629cb2454737
NVRM: GPU Board Serial Number: 0324518173584
NVRM: Xid (PCI:0000:1e:00): 62, pid=10230, 0a76(2d50) 00000000 00000000
According to the list of Xid errors (see https://docs.nvidia.com/deploy/xid-errors/index.html), error 62 refers to an internal micro-controller halt (newer drivers), and is related to hardware error, driver error or thermal issue.
The tests with previous NVIDIA drivers showed three GPUs, only. Still, error 62 is a bit unspecific, and I wonder how to find out if the 4th GPU is broken, or it is a driver problem.
Running
nvidia-bug-report.sh
partly worked because the script did not run completely, and resulted in an immediate system reboot. Using the additional option
--safe-mode
worked fine, instead. I enclose both log files.
Any help on that is appreciated.
Thanking in advance,
Frank
nvidia-bug-report.log-20191218-1.gz (59.7 KB)
nvidia-bug-report-safe-mode.log.gz (73.5 KB)