OS: Ubuntu 20.04.6 LTS
Driver Version: 470.256.02
CUDA: 11.4.3
GPUs: 3x NVIDIA GeForce RTX 3080
Hi,
I have three NVIDIA GeForce RTX 3080 running on a headless server for performing simulations.
From time to time the graphics cards are not reachable any more.
nvidia smi gives this error:
Unable to determine the device handle for GPU 0000:42:00.0: Unknown Error
I cannot determine the cause of this error.
nvidia-debugdump --list gives
Found 3 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce RTX 3080
GPU internal ID: GPU-a44d4585-6387-1f26-31d9-62c0819d054c
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error
nvidia-bug-report.log.gz (772.7 KB)
Thank you for your help!