In May 2019, I bought two RTX 2080Ti GPUs for deep learning research, running on Ubuntu Linux. Haven’t had any problems with them until a couple months ago, the 2nd GPU (“GPU 1”, the other one that still works is GPU 0), starts going offline when it’s in use (but not when it’s sitting idle).
nvidia-smi will delay for a while, and then list only GPU 0. Once GPU 1 goes offline it never comes back online until I reboot. Then if I start computations on GPU1, it’ll go offline again.
Why is this happening? Any idea what’s going on, and how to fix it?
I could imagine that, with the side-mounted fans, there might be a thermal issue, but…it’s been fine for a year until now. And it doesn’t come back online after it cools down, so,…?
EDIT: Here’s the current output from nvidia-smi, for those who ask. And driver version is nvidia-driver-440.
Fri Jul 3 13:29:07 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:17:00.0 Off | N/A |
| 0% 32C P8 7W / 250W | 1MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:65:00.0 Off | N/A |
| 0% 29C P8 19W / 250W | 26MiB / 11016MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 1359 G /usr/lib/xorg/Xorg 9MiB |
| 1 1538 G /usr/bin/gnome-shell 14MiB |
+-----------------------------------------------------------------------------+
NOTE: I’d thought maybe it could be a conflict with Xorg and gnome-shell, but that’s not it. If I disable the X server (it’s a machine I access remotely anyway), the problem persists
(cross-posted this on GeForce Forums too.)
EDIT: Tried to upload nvidia-bug-report.log.gz but forum upload says “Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif, log, doc, docx, txt, cpp, c, rtf).”