I have embedded hardware with a Quadro P5000 Mobile.
> lspci -s 4:0.0 -v
04:00.0 VGA compatible controller: NVIDIA Corporation GP104GLM [Quadro P5000 Mobile] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, windrvr1430, nvidia
I have a cuda-based application that has run with many other GPUs without issue. When I run with this hardware and any reasonable workload, I get the following message in the system log.
[ 216.262842] nvidia 0000:04:00.0: irq 554 for MSI/MSI-X
[ 217.082854] NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
[ 956.400788] NVRM: GPU at PCI:0000:04:00: GPU-f245f23d-16fa-2da0-d3b2-4f93986bc9ba
[ 956.408342] NVRM: Xid (PCI:0000:04:00): 79, pid=3121, GPU has fallen off the bus.
[ 956.415895] NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
[ 956.421902] NVRM: GPU 0000:04:00.0: GPU serial number is .
[ 956.427454] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Via gpumon, I know that the temperature is not exceeding 50 degrees C. I know that the GPU clock is ramping up near or at peak rate (1.9GHz). If I set the power limit (via nvidia-smi) to 50 W (rather than 60W), that lessens the probability of the error. Any suggestions on what I could try or how I could debug this issue further?
Below are details on the driver load.
[ 43.062740] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 43.071426] vgaarb: device changed decodes: PCI:0000:04:00.0,olddecodes=io+mem,decodes=none:owns=none
[ 43.155233] i40e 0000:05:00.0 p2p1: changing MTU from 1500 to 9216
[ 43.186544] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.129.06 Thu May 12 22:52:02 UTC 2022
[ 43.218614] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[ 43.219608] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[ 43.237069] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 470.129.06 Thu May 12 22:42:45 UTC 2022
[ 43.249801] nvidia-uvm: Loaded the UVM driver, major device number 233.
[ 43.255303] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[ 43.265749] Request for unknown module key 'VTS: 2c7d1e69445ce9f243f095be9883f9ea86ab5943' err -11
[ 43.265992] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 43.265994] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 0