Hi, I desperately need your help. I cant solve this problem by using ALL METHODS online.
I run my pytorch code for about 360 steps and this error is thrown by
unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU
nvidia-log is attached below.
however I still can detected my first GPU
>>lspci | grep VGA 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04) 01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1) 02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
I have also two 2080Ti with no Nvlink, on the same setting and motherboard, they also fail with same error: GPU is lost.
Note I have used
nvidia-smi -l 1 > watch.txt to make sure that this problem is not caused by overheating.
In addition, I don’t think it is PSU problem. I have 1200W power supporting titan xp and 1080Ti, 7700K and 4 * 8G memory, of course, motherboard and something else like hardisk… I use two 8pin_to_8pin to support one GPU, that is 4 * 8pin_to_8pin for titan xp and 1080Ti (previously, it is 2 * 2080Ti)
Appreciate your help sincerely in advance. I desperately need your help!
nvidia-bug-report.log (2.93 MB)