Full system freeze when GTX 970 is under load

Over the past few weeks I’ve been experiencing random full system freezes (can’t SSH, can’t switch TTY, last few seconds of audio loops) when my GTX 970 is under load. My main GPU with display outputs is an AMD RX 580 and my GTX 970 is only used for CUDA or NVENC, and otherwise idling. So over the past few weeks I experienced several freezes which occurred either while streaming (OBS encoding 1920×1080 60 FPS video with NVENC) or while training a neural network with CUDA (100% GPU utilization in nvidia-smi). The freezes started occurring on Fedora 35, now I’m on Fedora 36 with driver version 510.54 and I’m still getting the freezes. I use GNOME on Wayland (which all goes through the RX 580). There’s nothing in journalctl either: just normal messages and then it ends after the freeze has occurred.

What leads me to believe the freezes are caused by NVIDIA is that they only happen when it’s under load as I mentioned. It usually takes less than 30 mins of NN training to get a freeze like that.

nvidia-bug-report.sh after a reboot since I cannot SSH when the freeze occurs: nvidia-bug-report.log.gz (313.9 KB)

Seems like a driver mix:

фев 27 11:49:16 autumnblaze kernel: NVRM: API mismatch: the client has the version 510.47.03, but
NVRM: this kernel module has the version 510.54. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.

How did you install the drivers? Tried dracut -f to clean up your initrd?

How did you install the drivers?

rpm-ostree install kernel-devel akmod-nvidia xorg-x11-drv-nvidia-cuda

Tried dracut -f to clean up your initrd?

Since this is Silverblue, the initrd is rebuilt from a clean state every update. Maybe some issue with the NVIDIA package for Fedora branched? Checking rpm -qa | grep nvidia all packages have the same version.

Anyhow, the freezes happened even before I upgraded to F36, so I’m not sure this is the underlying issue.

It’s idling at 46°C, 52% fan speed. Doesn’t look fine. Already checked monitoring temperatures, cleaning fans and heat spreader?