GPU is lost during execution of pytorch code

2254142563 · March 10, 2019, 8:26am

When training a network with pytorch, sometimes after a random amount of time (a few minutes), the execution freezes and I get this message by running “nvidia-smi”:
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

I’m working with:
CentOS Linux release 7.5.1804 (Core)
RTX 2080Ti
CUDA Version 10.0.130
Pytorch 1.0.1.post2

nvidia-bug-report.log.gz (1.44 MB)

2254142563 · March 10, 2019, 8:29am

the log file

Robert_Crovella · March 10, 2019, 12:12pm

It’s sometimes an indication of either a temperature or power issue with the GPU. The GPU temperature may be getting to high or the power delivery to the GPU is inadequate.

These are just possibilities, of course, and such issues usually can’t be conclusively diagnosed in a forum like this. It usually requires trial/experimentation. For example, monitor temperature, try a larger power supply, etc.

Topic		Replies	Views
Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU Linux cuda , ubuntu	2	1041	October 28, 2022
GPU is lost. Reboot the system to recover this GPU CUDA Setup and Installation	1	4034	October 1, 2019
GPU lost after sometime of using it CUDA Setup and Installation	0	544	February 11, 2020
Keep losing RTX 2080 GPU. Linux	3	528	October 1, 2019
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU Linux	3	4231	April 6, 2020
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU Linux	9	7645	October 12, 2021
Unable to determine the device handle for GPU 0000:18:00.0: GPU is lost. Reboot the system to recover this GPU Nsight Visual Studio Edition boot	0	823	July 20, 2021
GPU loss while running very simple deep learning code possibly memory based Linux	4	987	February 1, 2019
GPU is lost ramdomly and nvidia-smi asks for a reboot to recover it Linux	3	2900	October 1, 2021
Unable to determine the device handle for gpu Linux	1	722	February 28, 2022

GPU is lost during execution of pytorch code

Related topics