Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error

Hello, we have a server with following characteristics:

  • OS: Ubuntu 24.04.1 LTS
  • 10 NVIDIA GeForce RTX 2080 Ti
  • CUDA version: 12.6
  • Driver Version: 560.35.03

It’s all fine for some days, but then gpus start to have problem and nvidia-smi reports the error Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error. We re-install drivers and reboot server, but again after some days the same error appears.

Here it’s the log when the error appears the second time.
nvidia-bug.log.gz (4.7 MB)

Do you have any idea what could be the cause?

1 Like

Might be overheating, defective fan, defective gpu. Please start by monitoring gpu temperatures.