I started running some cuda jobs on a machine with 10 * RTX3090.A few hours later, when i check how it goes with the cmd nvidia-smi, only get the error output: Unable to determine the device handle for GPU 0000:1E:00.0: GPU is lost. Reboot the system to recover this GPU.
What is XID 79, an error code? And what is the meaning of getting it?
I checked the power and temperatures via historical record. No overheating or lack of power was found. The electric power’s max output power is 8000w. The record showed it is never over 3000w. So i’m sure it’s not lack of power.
Besides, after I rebooted the system, executed nvidia-smi again, it gave me another error output msg
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
XIDs are nvidia error codes. XID 79 means the gpu was detached from the bus. No driver issue, plain hardware.
The message of nvidia-smi points to the driver not being loaded/installed. Please create a new nvidia-bug-report.log.
You have used the runfile installer to install the driver but without dkms so it only got compiled for the kernel that was running at that time (4.15.0-112). Now you got a kernel update to 4.15.0-137 so you have to reinstall the driver.
Only the latter, triggered by the reboot.
Like said, XID 79 is hardware, you’ll have to check. Might even be the gpu failing, though that’s a rare case.
Btw: everytime this happens, I try to sudo reset my machine, which disconnect my ssh connection and shutdown parts of the system. But the machine does not fully turn off and seems completely frozen. So, I have to hard-reset (push the power button for 5-10s) every time this happens.
You’re also getting an XID 79 but on a notebook when the gpu is stessed, as it seems. This rather points to a defective gpu. Please monitor temperatures.
Hello, I am facing the same problem.
I am running some deep learning code on a Nvidia GeForce GTX 1080Ti.
After running the code I can not see anymore both the two GPUs, if I run the command nvidia-smi I get as output the error: “Unable to determine the device handle for GPU 0000:08:00.0: Unknown Error”.