We have multiple server with 8 GeForce RTX 2080 Ti in each SuperMicro servers. We have about 7 of these server each with eight RTX 2080 Ti GPUs in them. We started noticing problems with some that the GPU disappears in nvidia-smi without any reason. Did an lshw -c display and it displayed all the GPUs. So they are recognized in the OS but not in nvidia-smi:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:1A:00.0 Off | N/A |
| 36% 35C P8 8W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… On | 00000000:3D:00.0 Off | N/A |
| 31% 33C P8 19W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 208… On | 00000000:3E:00.0 Off | N/A |
| 30% 35C P8 4W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… On | 00000000:88:00.0 Off | N/A |
| 31% 32C P8 11W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 GeForce RTX 208… On | 00000000:89:00.0 Off | N/A |
| 31% 33C P8 2W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 GeForce RTX 208… On | 00000000:B1:00.0 Off | N/A |
| 30% 33C P8 14W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 6 GeForce RTX 208… On | 00000000:B2:00.0 Off | N/A |
| 31% 34C P8 8W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
I would say it’s a bad GPU and send back for RMA but this happens quite a bit. Any suggestions?
nvidia-bug-report.log.gz (3.69 MB)