OS: Ubuntu Ubuntu 18.04.6 LTS
Driver Version: 460.91.03
GPUs: 4 a5000
I have an issue when one of mine gpus suddenly crashes. I have registered two cases in two days. As far as I can understand it is always the same gpu crashes. As far as I know this particular gpu didn’t compute anything special at the moment of the crash, but we use this machine for DL computations in general.
Yesterday I had an error.
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error
Today it is
Unable to determine the device handle for GPU 0000:21:00.0: GPU is lost. Reboot the system to recover this GPU
nvidia-debugdump --list
Found 4 NVIDIA devices
Device ID: 0
Device name: RTX A5000
GPU internal ID: 1323921072806
Error: nvmlDeviceGetHandleByIndex(): GPU is lost
FAILED to get details on GPU (0x1): GPU is lost
nvidia-bug-report.sh
nvidia-bug-report_01_17.log.gz (2.0 MB)
nvidia-bug-report_01_18.log.gz (819.9 KB)