Unable to determine the device handle for GPU

I started running some cuda jobs on a machine with 10 * RTX3090.A few hours later, when i check how it goes with the cmd nvidia-smi, only get the error output: Unable to determine the device handle for GPU 0000:1E:00.0: GPU is lost. Reboot the system to recover this GPU.

GPUs: 10 * RTX3090
NVIDIA-SMI 455.23.05
Driver Version: 455.23.05
CUDA Version: 11.1
Max Output Power: 8000w
nvidia-bug-report.sh log: nvidia-bug-report.log.gz (4.6 MB)

Is there any one knows why the gpu is lost?

You’re getting a XID 79, fallen off the bus. Most common reasons are overheting or lack of power. Monitor temperatures, check power connectors.

@generix Thank you so much for replying.

Some wonderings as bellow,

  1. What is XID 79, an error code? And what is the meaning of getting it?
  2. I checked the power and temperatures via historical record. No overheating or lack of power was found. The electric power’s max output power is 8000w. The record showed it is never over 3000w. So i’m sure it’s not lack of power.

Besides, after I rebooted the system, executed nvidia-smi again, it gave me another error output msg

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.

Any other ideas? Could it be the driver?

Appreciate it.

XIDs are nvidia error codes. XID 79 means the gpu was detached from the bus. No driver issue, plain hardware.
The message of nvidia-smi points to the driver not being loaded/installed. Please create a new nvidia-bug-report.log.

@generix Thanks, just found the doc of XID 79 , with you explanation, i figured it out.

Here it’s the new nvidia-bug-report.log.gz (98.2 KB)

You have used the runfile installer to install the driver but without dkms so it only got compiled for the kernel that was running at that time (4.15.0-112). Now you got a kernel update to 4.15.0-137 so you have to reinstall the driver.

Thanks @generix , i want to make it clearer。

  1. Was the kernel update causing the two error msg, or only the latter?
  2. If it caused both, either reinstall the driver with dkms, or reinstall the driver and disable the kernel update will solve this issue, right?

Only the latter, triggered by the reboot.
Like said, XID 79 is hardware, you’ll have to check. Might even be the gpu failing, though that’s a rare case.

@generix the latter triggered by the reboot confirmed, and i reinstall cuda with runfile installer. Thanks for you help!