Hello, we have a server with following characteristics:
- OS: Ubuntu 24.04.1 LTS
- 10 NVIDIA GeForce RTX 2080 Ti
- CUDA version: 12.6
- Driver Version: 560.35.03
It’s all fine for some days, but then gpus start to have problem and nvidia-smi
reports the error Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error
. We re-install drivers and reboot server, but again after some days the same error appears.
Here it’s the log when the error appears the second time.
nvidia-bug.log.gz (4.7 MB)
Do you have any idea what could be the cause?