Ran into a problem on my Lenovo Thinkpad P14s gen3 (NVIDIA T550) running Ubuntu 20.04
I am trying to evaluate a deep learning vision model using Pytorch and Cuda. When I launch my Python evaluation script, everything works fine for a bit (the program works perfectly) but then the script freezes (between 30 seconds to 2 minutes later).
When running nvidia-smi
I get the following error message :
Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error
When running nvidia-debugdump --list
I get :
Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
In order to reset everything and get the GPU to run again I have to forcefully reboot the computer (power off by holding down the power button).
I ran my evaluation script numerous times while monitoring the GPU’s performance but I cannot see anything alarming. While running my evaluation script I ran nvidia-smi -l 1
on a separate terminal screen in order to log the performance of the GPU as the model gets evaluated. I copied what was logged into a txt file.
On top of that I ran nvidia-bug-report.sh
after the error occurs and the GPU stopped functioning.
Did anyone run into a similar problem ? I wasn’t able to find a lot of information concerning these kinds of errors coming from the GPU. This post had a similar issue but unfortunately I don’t have the luxury of being able to remove my GPU from its slot and reinserting it since it’s mounted on a laptop.
nvidia-smi_-l_1.txt (19.6 KB)
nvidia-bug-report.log.gz (666.4 KB)