Unable to determine the device handle for GPU 0000:85:00.0: Unknown Error //GPU has fallen off the bus

wuqi55 · November 8, 2023, 8:23am

After training the machine learning model for a period of time at least 2 days NVIDIA-SMI will display "Unable to determine the device handle for GPU 0000:85:00.0: unknown error. "
The logs display “GPU has fallen off the bus”
Here is my nvidia-bug-report
I have found many ways on Google, but they are useless. I would appreciate it very much if someone could help me！
nvidia-bug-report.log.gz (3.2 MB)

generix · November 8, 2023, 8:54am

One gpu shut down, Xid 79. Likely due to overheating.

wuqi55 · November 8, 2023, 9:01am

the gpu which has fallen off the bus has been consistently below 70 degrees, so it shouldn’t be due to overheating. I’m currently checking the power supply.

wuqi55 · November 9, 2023, 5:42am

After checking the power supply and restarting, the gpu worked for a few hours and then went offline again.

wuqi55 · November 9, 2023, 7:24am

Nov 9 08:41:53 pvmed-190 kernel: [53128.517420] NVRM: A GPU crash dump has been created. If possible, please run
Nov 9 08:41:53 pvmed-190 kernel: [53128.517420] NVRM: nvidia-bug-report.sh as root to collect this data before
Nov 9 08:41:53 pvmed-190 kernel: [53128.517420] NVRM: the NVIDIA kernel module is unloaded.
the new error

generix · November 9, 2023, 7:46am

Did you swap power cords with another gpu to check for faulty connectors?
If it’s still happening, this might be a faulty gpu, bad solder joints or bad video memory cooling. Unfortunately, memory temperatures can’t be read with linux.

wuqi55 · November 9, 2023, 7:54am

Yes, I moved the gpu to a different slot, but the card still malfunctions. It is likely a problem with the gpu
.

Topic		Replies	Views
Unable to determine the device handle for GPU0000:65:00.0: Unknown Error Linux ubuntu	1	1989	March 1, 2023
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux kernel , nvidia-smi	7	4415	September 8, 2022
Ubuntu 17.10, Nvidia 390.48, CUDA 9.1, GPU has fallen off the bus Linux	1	1938	April 24, 2018
Unable to determine the device handle for GPU 0000:81:00.0: GPU is lost. Reboot the system to recover this GPU Linux	1	1199	February 24, 2019
GPU has fallen of the bus Linux	15	7609	July 19, 2019
GPU has fallen off the bus Linux	1	1016	January 15, 2021
GPU at 0000:02:00.0 has fallen off the bus. CUDA Programming and Performance	6	8964	November 28, 2011
GPU has fallen of the bus, nvidia-361.28, kernel 4.2.0 Linux	1	1607	February 28, 2016
kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus Linux	1	1036	November 18, 2016
Tesla K10 "has fallen off the bus" Linux	5	3245	May 13, 2013

Unable to determine the device handle for GPU 0000:85:00.0: Unknown Error //GPU has fallen off the bus

Related topics