Unable to determine the device handle for GPU 0000:21:00.0: GPU is lost. Reboot the system to recover this GPU

OS: Ubuntu Ubuntu 18.04.6 LTS
Driver Version: 460.91.03
GPUs: 4 a5000

I have an issue when one of mine gpus suddenly crashes. I have registered two cases in two days. As far as I can understand it is always the same gpu crashes. As far as I know this particular gpu didn’t compute anything special at the moment of the crash, but we use this machine for DL computations in general.

Yesterday I had an error.

Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error

Today it is

Unable to determine the device handle for GPU 0000:21:00.0: GPU is lost. Reboot the system to recover this GPU

nvidia-debugdump --list

Found 4 NVIDIA devices
	Device ID:              0
	Device name:            RTX A5000
	GPU internal ID:        1323921072806

Error: nvmlDeviceGetHandleByIndex(): GPU is lost
FAILED to get details on GPU (0x1): GPU is lost

nvidia-bug-report.sh

nvidia-bug-report_01_17.log.gz (2.0 MB)
nvidia-bug-report_01_18.log.gz (819.9 KB)

You’re getting a lot of pcie bus errors until the bus fails completely. Are you using risers to connect the gpu?

I am not sure.

I am renting this server so I don’t know its physical configuration nor don’t have physical access to the machine.

I can clarify with my hardware provider and come back later, if this information is crucial.

If it’s a rented server you should complain to your provider as pcie bus errors are hardware errors and anything required to fix it has to be done on the physical end.

Okay, thank you very much.