GPU freezes and stops responding during inference - "Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error"

Ran into a problem on my Lenovo Thinkpad P14s gen3 (NVIDIA T550) running Ubuntu 20.04

I am trying to evaluate a deep learning vision model using Pytorch and Cuda. When I launch my Python evaluation script, everything works fine for a bit (the program works perfectly) but then the script freezes (between 30 seconds to 2 minutes later).

When running nvidia-smi I get the following error message :

Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error

When running nvidia-debugdump --list I get :

Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

In order to reset everything and get the GPU to run again I have to forcefully reboot the computer (power off by holding down the power button).

I ran my evaluation script numerous times while monitoring the GPU’s performance but I cannot see anything alarming. While running my evaluation script I ran nvidia-smi -l 1 on a separate terminal screen in order to log the performance of the GPU as the model gets evaluated. I copied what was logged into a txt file.

On top of that I ran nvidia-bug-report.sh after the error occurs and the GPU stopped functioning.

Did anyone run into a similar problem ? I wasn’t able to find a lot of information concerning these kinds of errors coming from the GPU. This post had a similar issue but unfortunately I don’t have the luxury of being able to remove my GPU from its slot and reinserting it since it’s mounted on a laptop.

nvidia-smi_-l_1.txt (19.6 KB)
nvidia-bug-report.log.gz (666.4 KB)