I was facing this issue where one of the GPUs randomly disappears when idle. For a few months, we didn’t pay much attention and sometimes either the GPU started came back, or we did a restart. But now, one of them 1090Ti has completely disappeared not appearing in the list even after multiple restarts.
This is a lab machine mainly for Deep Learning, and I am not an expert on servers. But reading some related posts, I ran the nvidia-bug-report.sh and I have attached the log file here.
Can anyone help me out here?
nvidia-bug-report.log (2.5 MB)