Unable to determine the device handle for GPU

I started running some cuda jobs on a machine with 10 * RTX3090.A few hours later, when i check how it goes with the cmd nvidia-smi, only get the error output: Unable to determine the device handle for GPU 0000:1E:00.0: GPU is lost. Reboot the system to recover this GPU.

GPUs: 10 * RTX3090
NVIDIA-SMI 455.23.05
Driver Version: 455.23.05
CUDA Version: 11.1
Max Output Power: 8000w
nvidia-bug-report.sh log: nvidia-bug-report.log.gz (4.6 MB)

Is there any one knows why the gpu is lost?

You’re getting a XID 79, fallen off the bus. Most common reasons are overheting or lack of power. Monitor temperatures, check power connectors.

@generix Thank you so much for replying.

Some wonderings as bellow,

  1. What is XID 79, an error code? And what is the meaning of getting it?
  2. I checked the power and temperatures via historical record. No overheating or lack of power was found. The electric power’s max output power is 8000w. The record showed it is never over 3000w. So i’m sure it’s not lack of power.

Besides, after I rebooted the system, executed nvidia-smi again, it gave me another error output msg

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.

Any other ideas? Could it be the driver?

Appreciate it.

XIDs are nvidia error codes. XID 79 means the gpu was detached from the bus. No driver issue, plain hardware.
The message of nvidia-smi points to the driver not being loaded/installed. Please create a new nvidia-bug-report.log.

@generix Thanks, just found the doc of XID 79 , with you explanation, i figured it out.

Here it’s the new nvidia-bug-report.log.gz (98.2 KB)

You have used the runfile installer to install the driver but without dkms so it only got compiled for the kernel that was running at that time (4.15.0-112). Now you got a kernel update to 4.15.0-137 so you have to reinstall the driver.

Thanks @generix , i want to make it clearer。

  1. Was the kernel update causing the two error msg, or only the latter?
  2. If it caused both, either reinstall the driver with dkms, or reinstall the driver and disable the kernel update will solve this issue, right?

Only the latter, triggered by the reboot.
Like said, XID 79 is hardware, you’ll have to check. Might even be the gpu failing, though that’s a rare case.

@generix the latter triggered by the reboot confirmed, and i reinstall cuda with runfile installer. Thanks for you help!

I am have a similar (if not the same) issue. How did you solve it? Also, how did you gen this nvidia-bug-report to troubleshoot it?

Many thanks,

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

I just hard reset my computer. Should I wait until the problem occurs again? Or would it still capture the log of whatever happened before rebooting?

Here is the log file.

nvidia-bug-report.log.gz (168.4 KB)

Btw: everytime this happens, I try to sudo reset my machine, which disconnect my ssh connection and shutdown parts of the system. But the machine does not fully turn off and seems completely frozen. So, I have to hard-reset (push the power button for 5-10s) every time this happens.

You’re also getting an XID 79 but on a notebook when the gpu is stessed, as it seems. This rather points to a defective gpu. Please monitor temperatures.

Hello, I am facing the same problem.
I am running some deep learning code on a Nvidia GeForce GTX 1080Ti.
After running the code I can not see anymore both the two GPUs, if I run the command nvidia-smi I get as output the error: “Unable to determine the device handle for GPU 0000:08:00.0: Unknown Error”.

the nvidia-bug-report.sh log is:
nvidia-bug-report.log.gz (318.9 KB)

Is there anyone that may help?