Our server crashes while running a pytorch model after we moved our server to a new rack.
The program get stuck after a few iterations, and when we run nvidia-smi, it reports: “Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU”.
The same code runs perfectly before we moved the server, and we did not change anything but the position of the server, so we have no idea what is wrong.
Here’s what we have tried:
- reboot - can recover the GPU, but when we run any code, the “GPU is lost” problem occurs after a while.
- run with other 3 GPUs - no problem.
- check the temperature - the temperature only gose to 70C before the GPU is lost.
- check the power supply - we have not changed any hardware or software of the server, we only moved it, so the power supply should be the same with the previous.
- run command “nvidia-smi -pm 1” to turn on perssistance mode - still have the problem.
Here’s the bug log:
nvidia-bug-report.log (3.1 MB)
The server have 4 2080Ti.
If you know anything related to this problem, Please help.