Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

I am running a training job on a remote machine with 4 GPUs. The computation hangs after a day or two. When I checked Nvidia-smi, got the following error -
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

torch.cuda.is_available() returns True
nvidia-bug-report.log.gz (1.94 MB)

You ran into an XID 79. Most likely causes are lack of power supply or overheating.

We are also facing this problem in several machines. However, seems all the jobs continue without problem, just nvidia-smi stops working. Is there any way to restart nvidia-smi without affecting the running jobs?

If a gpu is lost, XID 79 it means it’s off. So it’s impossible a job running on it continues.