Quadro P5000 Mobile GPU has fallen off the bus

I have embedded hardware with a Quadro P5000 Mobile.

> lspci -s 4:0.0 -v
04:00.0 VGA compatible controller: NVIDIA Corporation GP104GLM [Quadro P5000 Mobile] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, windrvr1430, nvidia

I have a cuda-based application that has run with many other GPUs without issue. When I run with this hardware and any reasonable workload, I get the following message in the system log.

[  216.262842] nvidia 0000:04:00.0: irq 554 for MSI/MSI-X
[  217.082854] NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
[  956.400788] NVRM: GPU at PCI:0000:04:00: GPU-f245f23d-16fa-2da0-d3b2-4f93986bc9ba
[  956.408342] NVRM: Xid (PCI:0000:04:00): 79, pid=3121, GPU has fallen off the bus.
[  956.415895] NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
[  956.421902] NVRM: GPU 0000:04:00.0: GPU serial number is .
[  956.427454] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

Via gpumon, I know that the temperature is not exceeding 50 degrees C. I know that the GPU clock is ramping up near or at peak rate (1.9GHz). If I set the power limit (via nvidia-smi) to 50 W (rather than 60W), that lessens the probability of the error. Any suggestions on what I could try or how I could debug this issue further?

Below are details on the driver load.

[   43.062740] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[   43.071426] vgaarb: device changed decodes: PCI:0000:04:00.0,olddecodes=io+mem,decodes=none:owns=none
[   43.155233] i40e 0000:05:00.0 p2p1: changing MTU from 1500 to 9216
[   43.186544] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.129.06  Thu May 12 22:52:02 UTC 2022
[   43.218614] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[   43.219608] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[   43.237069] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.129.06  Thu May 12 22:42:45 UTC 2022
[   43.249801] nvidia-uvm: Loaded the UVM driver, major device number 233.
[   43.255303] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[   43.265749] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[   43.265992] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[   43.265994] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 0

Might be insufficient power supply. Setting the power limit is a bad workaround since this is a very slow regulation and doesn’t prevent power spikes.
Though since this is an embedded system, this might also point to defective hardware. Depends on how the gpu is managed. E.g. on notebooks, the dgpu is managed by the system bios so on power issues, the whole notebook would shut down but not the dgpu fall off the bus.
So you should rather contact the manufacturer of the system about it.

I’ve been able to run a GPU burn application on the board, which has a constant 80W power draw when the application is running, and that test has not had an issue. My suspicion is that it is related to instantaneous power spikes, but I’m not sure how to test and verify the power supply under those conditions. Any ideas on how I could isolate it being a power supply issue versus a embedded card issue?

I don’t know, since this is a Pascal based gpu. Turing and up have the nvidia-smi -lgc option to check for that. Also, like said, since this is an embedded system only the manufacturer of it can really tell what’s going on.