Quadro P5000 Mobile GPU has fallen off the bus

jeremy.pallotta · October 26, 2022, 6:23pm

I have embedded hardware with a Quadro P5000 Mobile.

> lspci -s 4:0.0 -v
04:00.0 VGA compatible controller: NVIDIA Corporation GP104GLM [Quadro P5000 Mobile] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, windrvr1430, nvidia

I have a cuda-based application that has run with many other GPUs without issue. When I run with this hardware and any reasonable workload, I get the following message in the system log.

[  216.262842] nvidia 0000:04:00.0: irq 554 for MSI/MSI-X
[  217.082854] NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
[  956.400788] NVRM: GPU at PCI:0000:04:00: GPU-f245f23d-16fa-2da0-d3b2-4f93986bc9ba
[  956.408342] NVRM: Xid (PCI:0000:04:00): 79, pid=3121, GPU has fallen off the bus.
[  956.415895] NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
[  956.421902] NVRM: GPU 0000:04:00.0: GPU serial number is .
[  956.427454] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

Via gpumon, I know that the temperature is not exceeding 50 degrees C. I know that the GPU clock is ramping up near or at peak rate (1.9GHz). If I set the power limit (via nvidia-smi) to 50 W (rather than 60W), that lessens the probability of the error. Any suggestions on what I could try or how I could debug this issue further?

Below are details on the driver load.

[   43.062740] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[   43.071426] vgaarb: device changed decodes: PCI:0000:04:00.0,olddecodes=io+mem,decodes=none:owns=none
[   43.155233] i40e 0000:05:00.0 p2p1: changing MTU from 1500 to 9216
[   43.186544] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.129.06  Thu May 12 22:52:02 UTC 2022
[   43.218614] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[   43.219608] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[   43.237069] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.129.06  Thu May 12 22:42:45 UTC 2022
[   43.249801] nvidia-uvm: Loaded the UVM driver, major device number 233.
[   43.255303] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[   43.265749] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[   43.265992] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[   43.265994] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 0

generix · October 28, 2022, 9:11am

Might be insufficient power supply. Setting the power limit is a bad workaround since this is a very slow regulation and doesn’t prevent power spikes.
Though since this is an embedded system, this might also point to defective hardware. Depends on how the gpu is managed. E.g. on notebooks, the dgpu is managed by the system bios so on power issues, the whole notebook would shut down but not the dgpu fall off the bus.
So you should rather contact the manufacturer of the system about it.

jeremy.pallotta · November 1, 2022, 12:45pm

I’ve been able to run a GPU burn application on the board, which has a constant 80W power draw when the application is running, and that test has not had an issue. My suspicion is that it is related to instantaneous power spikes, but I’m not sure how to test and verify the power supply under those conditions. Any ideas on how I could isolate it being a power supply issue versus a embedded card issue?

generix · November 2, 2022, 10:47am

I don’t know, since this is a Pascal based gpu. Turing and up have the nvidia-smi -lgc option to check for that. Also, like said, since this is an embedded system only the manufacturer of it can really tell what’s going on.

Topic		Replies	Views
Issue of GPU has fallen off the bus Linux	3	103	April 3, 2025
GPU has fallen off the bus Linux	1	970	September 21, 2021
GPU has fallen off the bus - GTX 1070 - nvidia-gfxG04-kmp-default-390.87 [Solved - dead GPU] Linux	9	1707	October 4, 2018
"GPU has fallen off the bus" while idle, only occurs when all displays powered off Linux	15	7860	March 15, 2025
GPU has fallen off the bus Linux	7	9784	September 12, 2023
Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus Linux cuda , tensorflow , ubuntu , linux	6	4112	December 26, 2021
Please Help Another NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. RTX4090 Linux	1	1164	August 18, 2023
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	39	10651	March 18, 2025
Crash on RTX 6000 Ada on Ubuntu 24.04 "GPU has fallen off the bus" Linux llama	8	204	March 14, 2025
No signal to monitor since "NVRM: GPU at 0000:01:00.0 has fallen off the bus" Linux	12	5514	August 26, 2013

Quadro P5000 Mobile GPU has fallen off the bus

Related topics