Hi
I am running an AI model using Tensorflow Serving via Docker on a PC with NVIDIA GeForce RTX 3060.
Recently I noticed that the display freezes, and I am unable to contact the GPU using nvidia-smi. This happened after installing Ubuntu 22.04 from scratch and installing the recommended driver.
This is the error I get:
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error
After a reboot it works again. Sometimes it happens after a couple of hours and sometimes more. The PC is not always busy running the models when it crashes. Sometimes it happens during the night.
I tried to install the recommended nvidia driver 550, but it did not work. I have also tried with 535.
I have also tried with a new version of ubuntu 24.04, but also with no luck.
Is this a driver problem? Or a problem with PCIe Bus?
Xid 79 suggest that it can be one of the following problems, but that does not narrow it down so much:
Hardware error | Driver Error | System Memory Corruption | Bus Error | Thermal Issue
System Details:
OS: Debian ~22.04.1-Ubuntu SMP x86_64 GNU/Linux
Kernel: 6.8.0-49-generic 86_64
GPU: NVIDIA GeForce RTX 3060
Nvidia driver: 550.120
CUDA version: 12.4
nvidia-debugdump -l
Found 1 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce RTX 3060 (*PrimaryCard)
GPU internal ID: GPU-1f874bcf-52ec-e7c4-6447-7af1ca832cdc
nvidia-bug-report.log.gz (182.2 KB)
syslog from around the time the error occured and the nvidia-bug report.
syslog.log (1.7 KB)
nvidia-bug-report.log.gz (182.2 KB)