Oct 08 20:23:04 hwkiller-desktop kernel: NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. Oct 08 20:23:04 hwkiller-desktop kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. Oct 08 20:23:04 hwkiller-desktop kernel: NVRM: Xid (PCI:0000:01:00): 79, pid=797, GPU has fallen off the bus. Oct 08 20:23:04 hwkiller-desktop kernel: NVRM: GPU at PCI:0000:01:00: GPU-0aa5ef0d-2a02-ee18-f14e-bbd9ebf50562
I have now experienced this crash three times since September; twice today.
Distro: Arch linux
Driver: 455.23.04 (DKMS)
GPU: Evga Nvidia 960 4GB
It seemingly occurs out of the blue. Earlier, I was on a zoom call, and it crashed with that error. Later, it was 5 hours into a model fit [this did not use the GPU, but CPU was 100% on all cores].
Things I have noted:
- I am not near ram capacity.
- My temperatures are fine (
sensors | grep -i Corefor CPU, and nvidia-smi for GPU, were ~80c and 50c respectively).
- When stresstesting with
stress -c 6for the CPU and
gpu_burn 360for the GPU, no issues occurred. Temperatures were, again, around 80c for CPU and 77c for GPU
Since I experienced it for the second time, I reseated my GPU and its power plugs, and am trying another kernel (5.8.14.zen1). I will see whether this crash occurs again after the reseating. There was also a new nvidia update, so I updated to nvidia 455.28.
I am attaching the log below.
nvidia-bug-report.log.gz (826.9 KB)
Are there any steps I can or should take to diagnose this problem should it occur again?