Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

After training the machine learning model for a period of time each time, NVIDIA-SMI will display "Unable to determine the device handle for GPU 0000:01:00.0: unknown error. "
The logs display “Failed detecting connected display devices” and “Failed to grab modeset ownership are displayed”
Here is my nvidia-bug-report
nvidia-bug-report 2.log (471.3 KB)
I have found many ways on Google, but they are useless. I would appreciate it very much if someone could help me!

04 17:13:12 612-server kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.

Insufficient power, overheating.

1 Like

My GPU temperature has been kept at about 85C. Is this overheating? In addition, how can I solve the problem of insufficient power? My power supply is 850W, which is completely enough.

To check for PSU issues, you can limit clocks to avoid power spikes due to gpu boost
nvidia-smi -lgc 300,1500
Apart from gpu temperature, there’s also the memory temperature which unfortunately can’t be read on Linux. So while 85°C isn’t great but ok for the gpu the memory might still be overheating.

1 Like

Thank you very much for your reply. I don’t have this problem when training models with smaller VRAM. Only when the large VRAM (20000mb / 24576mb) model is trained for more than 2 hours will the GPU fall off the bus be displayed.

Really sounds like a memory thermal issue. You might want to use gpu-burn and cuda-gpumemtest to check for general faults.

1 Like

I replaced a new power cable and set nvidia-drm.modeset=1 and intel_idle.max_cstate=1 and intel_pstate=enable and pcie_aspm=off. The graphics card has been running successfully for 12 hours without dropping the line. I haven’t done the ablation experiment, but at present, these settings are generally effective. Thank you very much for your suggestions!

Unfortunately, it dropped again after training for about 24 hours.