1 year in, RTX2080Ti (2nd of 2) has started going offline. Help?

In May 2019, I bought two RTX 2080Ti GPUs for deep learning research, running on Ubuntu Linux. Haven’t had any problems with them until a couple months ago, the 2nd GPU (“GPU 1”, the other one that still works is GPU 0), starts going offline when it’s in use (but not when it’s sitting idle).

nvidia-smi will delay for a while, and then list only GPU 0. Once GPU 1 goes offline it never comes back online until I reboot. Then if I start computations on GPU1, it’ll go offline again.

Why is this happening? Any idea what’s going on, and how to fix it?

I could imagine that, with the side-mounted fans, there might be a thermal issue, but…it’s been fine for a year until now. And it doesn’t come back online after it cools down, so,…?

EDIT: Here’s the current output from nvidia-smi, for those who ask. And driver version is nvidia-driver-440.

Fri Jul  3 13:29:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:17:00.0 Off |                  N/A |
|  0%   32C    P8     7W / 250W |      1MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:65:00.0 Off |                  N/A |
|  0%   29C    P8    19W / 250W |     26MiB / 11016MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      1359      G   /usr/lib/xorg/Xorg                             9MiB |
|    1      1538      G   /usr/bin/gnome-shell                          14MiB |
+-----------------------------------------------------------------------------+

NOTE: I’d thought maybe it could be a conflict with Xorg and gnome-shell, but that’s not it. If I disable the X server (it’s a machine I access remotely anyway), the problem persists

(cross-posted this on GeForce Forums too.)

EDIT: Tried to upload nvidia-bug-report.log.gz but forum upload says “Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif, log, doc, docx, txt, cpp, c, rtf).”