After training the machine learning model for a period of time at least 2 days NVIDIA-SMI will display "Unable to determine the device handle for GPU 0000:85:00.0: unknown error. "
The logs display “GPU has fallen off the bus”
Here is my nvidia-bug-report
I have found many ways on Google, but they are useless. I would appreciate it very much if someone could help me!
nvidia-bug-report.log.gz (3.2 MB)
One gpu shut down, Xid 79. Likely due to overheating.
the gpu which has fallen off the bus has been consistently below 70 degrees, so it shouldn’t be due to overheating. I’m currently checking the power supply.
After checking the power supply and restarting, the gpu worked for a few hours and then went offline again.
Nov 9 08:41:53 pvmed-190 kernel: [53128.517420] NVRM: A GPU crash dump has been created. If possible, please run
Nov 9 08:41:53 pvmed-190 kernel: [53128.517420] NVRM: nvidia-bug-report.sh as root to collect this data before
Nov 9 08:41:53 pvmed-190 kernel: [53128.517420] NVRM: the NVIDIA kernel module is unloaded.
the new error
Did you swap power cords with another gpu to check for faulty connectors?
If it’s still happening, this might be a faulty gpu, bad solder joints or bad video memory cooling. Unfortunately, memory temperatures can’t be read with linux.
Yes, I moved the gpu to a different slot, but the card still malfunctions. It is likely a problem with the gpu
.