Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

shubhankar · March 5, 2020, 8:54pm

I am running a training job on a remote machine with 4 GPUs. The computation hangs after a day or two. When I checked Nvidia-smi, got the following error -
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

torch.cuda.is_available() returns True
nvidia-bug-report.log.gz (1.94 MB)

generix · March 25, 2020, 10:27pm

You ran into an XID 79. Most likely causes are lack of power supply or overheating.

dengxiang1029 · April 6, 2020, 8:18pm

We are also facing this problem in several machines. However, seems all the jobs continue without problem, just nvidia-smi stops working. Is there any way to restart nvidia-smi without affecting the running jobs?

generix · April 6, 2020, 10:52pm

If a gpu is lost, XID 79 it means it’s off. So it’s impossible a job running on it continues.