I have 4 NVIDIA A10 24GB cards installed on Ubuntu 18.04.5 LTS with kernel version 4.15.0-128-generic. After a reboot, I found one of these card dropped in nvidia-smi. But it can be still detectable in lspci. Repeated reboot cannot solve this problem. I have collect the nvidia-bug-report.
nvidia-bug-report.log.gz (1.5 MB)
I don’t know how to fix this, if anyone can help me I’d really appreciate it.
nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 On | 00000000:17:00.0 Off | 0 |
| 0% 39C P8 9W / 150W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A10 On | 00000000:25:00.0 Off | 0 |
| 0% 40C P8 9W / 150W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A10 On | 00000000:D9:00.0 Off | 0 |
| 0% 39C P8 9W / 150W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
sudo lspci | grep -i nvidia
17:00.0 3D controller: NVIDIA Corporation Device 2236 (rev a1)
25:00.0 3D controller: NVIDIA Corporation Device 2236 (rev a1)
c5:00.0 3D controller: NVIDIA Corporation Device 2236 (rev ff)
d9:00.0 3D controller: NVIDIA Corporation Device 2236 (rev a1)