Unable to determine the device handle for GPU

1051859379 · March 17, 2021, 4:18am

I started running some cuda jobs on a machine with 10 * RTX3090.A few hours later, when i check how it goes with the cmd nvidia-smi, only get the error output: Unable to determine the device handle for GPU 0000:1E:00.0: GPU is lost. Reboot the system to recover this GPU.

GPUs: 10 * RTX3090
NVIDIA-SMI 455.23.05
Driver Version: 455.23.05
CUDA Version: 11.1
Max Output Power: 8000w
nvidia-bug-report.sh log: nvidia-bug-report.log.gz (4.6 MB)

Is there any one knows why the gpu is lost?

generix · March 17, 2021, 8:57am

You’re getting a XID 79, fallen off the bus. Most common reasons are overheting or lack of power. Monitor temperatures, check power connectors.

1051859379 · March 17, 2021, 1:56pm

@generix Thank you so much for replying.

Some wonderings as bellow,

What is XID 79, an error code? And what is the meaning of getting it?
I checked the power and temperatures via historical record. No overheating or lack of power was found. The electric power’s max output power is 8000w. The record showed it is never over 3000w. So i’m sure it’s not lack of power.

Besides, after I rebooted the system, executed nvidia-smi again, it gave me another error output msg

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.

Any other ideas? Could it be the driver?

Appreciate it.

generix · March 17, 2021, 2:06pm

XIDs are nvidia error codes. XID 79 means the gpu was detached from the bus. No driver issue, plain hardware.
The message of nvidia-smi points to the driver not being loaded/installed. Please create a new nvidia-bug-report.log.

1051859379 · March 17, 2021, 2:18pm

@generix Thanks, just found the doc of XID 79 , with you explanation, i figured it out.

Here it’s the new nvidia-bug-report.log.gz (98.2 KB)

generix · March 17, 2021, 2:21pm

You have used the runfile installer to install the driver but without dkms so it only got compiled for the kernel that was running at that time (4.15.0-112). Now you got a kernel update to 4.15.0-137 so you have to reinstall the driver.

1051859379 · March 17, 2021, 2:55pm

Thanks @generix , i want to make it clearer。

Was the kernel update causing the two error msg, or only the latter?
If it caused both, either reinstall the driver with dkms, or reinstall the driver and disable the kernel update will solve this issue, right?

generix · March 17, 2021, 3:31pm

Only the latter, triggered by the reboot.
Like said, XID 79 is hardware, you’ll have to check. Might even be the gpu failing, though that’s a rare case.

1051859379 · March 18, 2021, 1:46am

@generix the latter triggered by the reboot confirmed, and i reinstall cuda with runfile installer. Thanks for you help!

eduardo4jesus · June 28, 2022, 5:39pm

I am have a similar (if not the same) issue. How did you solve it? Also, how did you gen this nvidia-bug-report to troubleshoot it?

Many thanks,

generix · June 28, 2022, 6:05pm

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

eduardo4jesus · June 28, 2022, 6:10pm

I just hard reset my computer. Should I wait until the problem occurs again? Or would it still capture the log of whatever happened before rebooting?

eduardo4jesus · June 29, 2022, 7:53pm

Here is the log file.

nvidia-bug-report.log.gz (168.4 KB)

Btw: everytime this happens, I try to sudo reset my machine, which disconnect my ssh connection and shutdown parts of the system. But the machine does not fully turn off and seems completely frozen. So, I have to hard-reset (push the power button for 5-10s) every time this happens.

generix · June 30, 2022, 10:01am

You’re also getting an XID 79 but on a notebook when the gpu is stessed, as it seems. This rather points to a defective gpu. Please monitor temperatures.

serg.vitale · September 14, 2022, 2:19pm

Hello, I am facing the same problem.
I am running some deep learning code on a Nvidia GeForce GTX 1080Ti.
After running the code I can not see anymore both the two GPUs, if I run the command nvidia-smi I get as output the error: “Unable to determine the device handle for GPU 0000:08:00.0: Unknown Error”.

the nvidia-bug-report.sh log is:
nvidia-bug-report.log.gz (318.9 KB)

Is there anyone that may help?

Topic		Replies	Views
Unable to determine the device handle for GPU :GPU is lost Linux	10	31970	August 11, 2021
Unable to determine the device handle for GPU xxxxxxxx: Unknown Error Linux ubuntu , kb	4	10956	October 15, 2022
Unable to determine the device handle for GPU0000:65:00.0: Unknown Error Linux ubuntu	1	1952	March 1, 2023
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error Linux ubuntu , driver	15	16441	February 4, 2025
Unable to determine the device handle for GPU 0000:81:00.0: GPU is lost. Reboot the system to recover this GPU Linux	1	1186	February 24, 2019
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU Linux	3	4245	April 6, 2020
Unable to determine the device handle for GPU 0000:04:00.0: Unknown Error Linux driver	1	1829	August 31, 2022
Unable to determine the device handle for GPU0000:05:00.0: Unknown Error Linux	0	201	October 31, 2024
Unable to determine device handle (TitanX - Ubuntu 18.04 - NVIDIA Driver 460.32, CUDA 11.2) Linux ubuntu , gpu	1	1031	March 22, 2021
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. on Titan xp and 1080Ti CUDA Setup and Installation	4	7370	June 21, 2022

Unable to determine the device handle for GPU

Related topics