Hello, I was using my GeForce GTX1080 Ti FTW3 graphics card to train a model with PyTorch on Ubuntu 18.04 and suddenly lost my remote desktop. I had to reboot the machine to get it back but after doing so I get the error “CUDA error: all CUDA-capable devices are busy or unavailable” when running the same code that was just working and I didn’t make any changes to the system. I’m wondering if there’s a way to tell if my card has failed or I have some other issue with my system?
I’m trying to debug this and just replaced my card with a GeForce GTX 1080 FTW and I get the same error. Calling torch.cuda.is_available() returns true. Any idea why my code would stop working in the middle of training and not come back up after rebooting when not changes to the system have been made?
I’ve placed the GeForce GTX1080 Ti FTW3 into a Windows system with PyTorch and CUDA installed and have no problem using CUDA from PyTorch there but both cards fail in my Ubuntu system. Are there any utilities that can help identify why my cards work in one system but not the other?
Well, I got it working again by booting the system with a monitor attached to the graphics card. It now works over remote desktop too but it seems only if it boots with a monitor attached. Does anyone know what modification I need to make so I can remove the monitor from the system?
I’ve been thinking about this and I think what happened is that Nvidia wasn’t selected as the default driver when it was installed and with no monitors attached the system uses the Nouveau driver when it boots which breaks the CUDA functionality. This is just a guess but I can’t think of another reasonable explanation.