Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

spadexiao6 · September 4, 2022, 10:10am

After training the machine learning model for a period of time each time, NVIDIA-SMI will display "Unable to determine the device handle for GPU 0000:01:00.0: unknown error. "
The logs display “Failed detecting connected display devices” and “Failed to grab modeset ownership are displayed”
Here is my nvidia-bug-report
nvidia-bug-report 2.log (471.3 KB)
I have found many ways on Google, but they are useless. I would appreciate it very much if someone could help me！

generix · September 5, 2022, 8:25am

04 17:13:12 612-server kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.

Insufficient power, overheating.

spadexiao6 · September 6, 2022, 5:31am

My GPU temperature has been kept at about 85C. Is this overheating? In addition, how can I solve the problem of insufficient power? My power supply is 850W, which is completely enough.

generix · September 6, 2022, 7:05am

To check for PSU issues, you can limit clocks to avoid power spikes due to gpu boost
nvidia-smi -lgc 300,1500
Apart from gpu temperature, there’s also the memory temperature which unfortunately can’t be read on Linux. So while 85°C isn’t great but ok for the gpu the memory might still be overheating.
https://forums.developer.nvidia.com/t/request-gpu-memory-junction-temperature-via-nvidia-smi-or-nvml-api/168346

spadexiao6 · September 6, 2022, 8:11am

Thank you very much for your reply. I don’t have this problem when training models with smaller VRAM. Only when the large VRAM (20000mb / 24576mb) model is trained for more than 2 hours will the GPU fall off the bus be displayed.

generix · September 6, 2022, 9:15am

Really sounds like a memory thermal issue. You might want to use gpu-burn and cuda-gpumemtest to check for general faults.

spadexiao6 · September 7, 2022, 4:07am

I replaced a new power cable and set nvidia-drm.modeset=1 and intel_idle.max_cstate=1 and intel_pstate=enable and pcie_aspm=off. The graphics card has been running successfully for 12 hours without dropping the line. I haven’t done the ablation experiment, but at present, these settings are generally effective. Thank you very much for your suggestions!

spadexiao6 · September 8, 2022, 9:00am

Unfortunately, it dropped again after training for about 24 hours.

Topic		Replies	Views
Unable to determine the device handle for GPU 0000:85:00.0: Unknown Error //GPU has fallen off the bus Linux linux	6	698	November 9, 2023
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux ubuntu , nvidia-smi	7	4101	March 12, 2024
Unable to determine the device handle for GPU0000:C1:00.0: Unknown Error Linux cuda , kernel , ubuntu	4	517	February 23, 2024
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	5067	November 9, 2022
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux ubuntu	1	478	May 25, 2024
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error Linux ubuntu , driver	15	16157	February 4, 2025
UnaUnable to determine the device handle for GPU Linux	1	339	October 12, 2022
Nvidia-smi error after a few minutes of up time Linux	3	522	January 23, 2023
Unable to determine the device handle for GPU0000:18:00.0: Unknown Error Linux	0	789	May 27, 2023
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Drivers - Linux, Windows, MacOS	1	253	September 14, 2024

Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

Related topics