How to solve NVRM: GPU 0000:01:00.0: GPU has fallen off the bus completely?

chilin.cs07 · October 16, 2022, 6:53am

I training DL model on GPU and stuck after 20 minutes, also cannot access to GPU by nvidia-smi when the error occur. After rebooting, I can access to GPU by nvidia-smi without error, but when run training program, the problem happened again after training 20 minutes. I using same program to do training DL model many hours without error before, so annoying and wired.

Some root cause and strategy of error:

Overheating
Insufficient/unstable power supply
Replace slot
System bios updates

I monitor temperatures when training, and the temperatures always < 50 C, so it is not overheating. I Also try enable persistent mode but not work. Finally I using solution from Unable to determine the device handle for GPU xxxxxxxx: Unknown Error, using following command temporarily to solve the error, training program executing stably for 30 minutes and error happened again. I want to how to solve the root cause completely？

temporarily solve the error by:

nvidia-smi -lgc 300,1500

some useful message:

GPU and driver:
- driver: 515.76
- GPU: NVIDIA GeForce RTX 3060
- CUDA: 11.7
- Tensorflow: 2.10.0

dmesg -T

[Sun Oct 16 11:45:40 2022] NVRM: GPU at PCI:0000:01:00: GPU-903cc954-07f3-f490-e3d4-7e79bffaa22f
[Sun Oct 16 11:45:40 2022] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Sun Oct 16 11:45:40 2022] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[Sun Oct 16 11:45:40 2022] NVRM: A GPU crash dump has been created. If possible, please run
                           NVRM: nvidia-bug-report.sh as root to collect this data before
                           NVRM: the NVIDIA kernel module is unloaded.

nvidia-debugdump --list

Error: nvmlDeviceGetHandleByIndex(): Unknown 
Error FAILED to get details on GPU (0x0): Unknown Error

nvidia-smi

Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

bug repor
nvidia-bug-report.log.gz (127.3 KB)

seandriard3 · October 17, 2022, 10:10pm

I have the exact same error with a gtx 970

generix · October 19, 2022, 10:34am

[ 1799.393003] pcieport 0000:00:01.0: AER: Multiple Corrected error received: 0000:00:01.0
[ 1799.442780] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 1799.442781] pcieport 0000:00:01.0:   device [8086:460d] error status/mask=00002001/00002000
[ 1799.442782] pcieport 0000:00:01.0:    [ 0] RxErr

Your pcie bus is breaking down. In case you’re using a riser, please remove it. Another reason for this might be the pcie chipset on the mainboard overheating, please check if this is actively cooled, fan is working, no dust. You can try to work around it lowering pcie gen in bios if available.

matthias.seuchter · March 27, 2025, 1:05pm

Same issue after kernel and nvidia kernel updates on my side too, nut happens only when system was unused and both screens going into sleep mode.
Mar 27 13:46:50 LP1-ITD-67-1 kernel: [19340.288868] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Mar 27 13:48:13 LP1-ITD-67-1 kernel: [19423.213032] pcieport 0000:00:01.0: PME: Spurious native interrupt!
Mar 27 13:48:44 LP1-ITD-67-1 kernel: [19454.219790] NVRM: GPU at PCI:0000:01:00: GPU-5f600ee7-670a-8db0-6a2a-ba2f5295f127
Mar 27 13:48:44 LP1-ITD-67-1 kernel: [19454.219793] NVRM: Xid (PCI:0000:01:00): 79, pid=‘’, name=, GPU has fallen off the bus.
Mar 27 13:48:44 LP1-ITD-67-1 kernel: [19454.219795] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Mar 27 13:48:49 LP1-ITD-67-1 kernel: [19458.923136] NVRM: Error in service of callback

Topic		Replies	Views
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Drivers - Linux, Windows, MacOS	1	355	September 14, 2024
Unable to determine the device handle for GPU 0000:85:00.0: Unknown Error //GPU has fallen off the bus Linux linux	6	786	November 9, 2023
Unable to determine the device handle for GPU0000:05:00.0: Unknown Error Linux	0	280	October 31, 2024
How to address the error. "Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error" Linux boot , kb	1	2815	November 28, 2022
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux	17	46840	December 19, 2024
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux ubuntu , nvidia-smi	7	4258	March 12, 2024
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error Linux ubuntu , driver	15	17594	February 4, 2025
Unable to determine the device handle for GPU0000:65:00.0: Unknown Error Linux ubuntu	1	2030	March 1, 2023
GPU has fallen off the bus (L40S) Linux cuda	1	161	September 24, 2025
UnaUnable to determine the device handle for GPU Linux	1	370	October 12, 2022

How to solve NVRM: GPU 0000:01:00.0: GPU has fallen off the bus completely?

Related topics