Frequent X11 crash with 4.70 RTX2060 at 5120x1440 on 5.13 kernel

Hello,

Got very frequent and random crashes on ubuntu 21.10, sometimes it takes half an hour sometimes 3 seconds after boot, no specific trigger, can be any application or action.
Tried 4.60 and 4.50 also with no difference, at 3840x1080 system is stable, used monitor is AOC AG493UCX

greetings Jaap

dmesg.log (131.8 KB)
nvidia-bug-report.log.gz (112.2 KB)
nvidia-nvml-temp13971.log (2.9 KB)

Looks a bit odd, seems this is triggered by Ubuntu’s gpu-manager. Please try creating /etc/X11/xorg.conf just containing

Section "Device"
    Identifier     "nvidia"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:1:0:0"
EndSection

and set kernel parameter
nogpumanager
Then please check if the crash still occurs after reboot.

Unfortunately no improvement, checked gpu-manager log that it was disabled, not sure if xorg.conf was loaded since I had to create it.
Created a fresh nvidia-bugreport but that was empty.
I also wish to withdraw my statement that on lower resolution the system is stable, got 2 lockups with 3840x1080 to.
Sorry I have no further information at this point, will try to create a decent bugreport with nogpumanager kernel switch.

Thanks for helping out so far.

You could also just run
sudo journalctl -b-1 |grep kernel >kernel.txt
after reboot to get a dmesg from the previous boot.

This bug report is changing in another direction, to get some work done I used the Nouveau driver and that one also crashes, not a complete hang I can reboot with ssh connection and with the nvidia driver not but still do…

The last freeze with the nvidia driver with nogpumanager did not write anything in dmesg, tried a couple of freezes and no trace of error in the logs, also nothing in nvidia bugreport but irq/193-nvidia is trowing 100% cpu in top.

Attached the dmesg outputs from both the nvidia and the nouveau incidents.

dmesg.txt (98,9 KB)
dmesg-nouveau.txt (262,7 KB)

Rather looks like a general gpu hw failure. Please check your gpu using gpu-burn.

Having some issues compiling gpu-burn due to glibc/cuda version mismatch, will post the outcome as soon as I have test results, thanks for helping out so far!

Found another similar error with same hardware Intel NUC11PHi7:
https://www.mail-archive.com/ubuntu-bugs@lists.ubuntu.com/msg5960877.html

Did you already check for a bios upgrade? While I don’t really think it helps, try setting kernel parameter
intel_idle.max_cstate=1

Running latest bios version, currently using intel_idle.max_cstate=1 and nogpumanager as kernel options, system is currently ~40 hours stable and still running with the 470 driver…