Oct 08 20:23:04 hwkiller-desktop kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Oct 08 20:23:04 hwkiller-desktop kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Oct 08 20:23:04 hwkiller-desktop kernel: NVRM: Xid (PCI:0000:01:00): 79, pid=797, GPU has fallen off the bus.
Oct 08 20:23:04 hwkiller-desktop kernel: NVRM: GPU at PCI:0000:01:00: GPU-0aa5ef0d-2a02-ee18-f14e-bbd9ebf50562
Hi all,
I have now experienced this crash three times since September; twice today.
Distro: Arch linux
Kernel: linux-ck-skylake
Driver: 455.23.04 (DKMS)
GPU: Evga Nvidia 960 4GB
It seemingly occurs out of the blue. Earlier, I was on a zoom call, and it crashed with that error. Later, it was 5 hours into a model fit [this did not use the GPU, but CPU was 100% on all cores].
Things I have noted:
- I am not near ram capacity.
- My temperatures are fine (
sensors | grep -i Core
for CPU, and nvidia-smi for GPU, were ~80c and 50c respectively). - When stresstesting with
stress -c 6
for the CPU andgpu_burn 360
for the GPU, no issues occurred. Temperatures were, again, around 80c for CPU and 77c for GPU
Since I experienced it for the second time, I reseated my GPU and its power plugs, and am trying another kernel (5.8.14.zen1). I will see whether this crash occurs again after the reseating. There was also a new nvidia update, so I updated to nvidia 455.28.
I am attaching the log below.
nvidia-bug-report.log.gz (826.9 KB)
Are there any steps I can or should take to diagnose this problem should it occur again?