First of all, I apologize for double posting. I removed the previous one immediately.
During a CUDA job, I got the following message: ‘Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU’ and I rebooted the system as suggested. Unfortunately, after the reboot, nvidia-smi started to show ‘no devices were found’.
I see that it is a very common problem but some of the proposed solutions or diagnoses (like ‘it is a hardware problem’) were stated without the explicit reason so I figured I should repeat the question.
My OS is Ubuntu 18.04 and the GPU is RTX 2080Ti. I connect to the system via ssh, and I do not have physical access to the machine at the moment.
I tried removing the nvidia drivers and reinstalling them but the problem persists.
I attach the nvidia-bug-report file which I ran after the crash. Also I include the latest output of ‘dmesg’ attached. (For some reason I couldn’t find the ‘paperclip icon’ after I created the topic so I’m including a Google Drive folder with both files: https://drive.google.com/drive/folders/1NNrCCzDAzovNXpXK5L06G6PYuJ_yQJ7b?usp=sharing )
Please let me know if I need to provide any other information.
Thanks in advance for your help!