I’ve recently faced quite some issues with my NVIDIA GeForce RTX 2070 Super with Max-Q Design & linux kernel 5.15.0-89-generic.
It started when I somehow upgraded my kernel to 5.15.0-89-generic, while still on ubuntu 20.04. My system would frequently freeze, and be slow to shutdown or to start. Before the kernel update, I was using driver version 470 with toolkit 11.4, and everything was working as expected.
After sort of isolating the issue down to the graphics card (e.g. no problems noticed when fully switching to internal intel GPU with prime-select) I decided to upgrade ubuntu altogether.
On 22.04 I’ve been running on driver version 525 & toolkit 12.0. This seems to have fixed my issues up to a degree. While my system hasn’t completely frozen yet (instead it freezes few milliseconds now & then), it is still slow to start or shutdown (sometimes I have to shutdown by going into tty or at other times only alt + PrtSc + REISUB will help), and sometimes the GPU even crashes altogether (meaning that the nvidia driver/tookit crash).
I also experimented with kernel 6.2.0-37-generic but this was even worse. First of all, only the latest drivers (545) & toolkit (12.3) would work at all. But then anytime I gave any serious workload to the GPU it would simply crash. Basically not usable at all.
Important to note that I also have an external 3090 RTX GPU, which, however, hasn’t exhibited any of the issues that 2070 has.
So I guess what I’m looking for is some sort of confirmation that the above reasoning makes sense, and it indeed could be a driver/tookit issue with RTX 2070. Are there some other driver/toolkit combos I could still try for more stability with ubuntu 22.04 with kernel 5.15?
I’m also a bit worried that this could be a physical issue with the GPU itself. Any tests I could run for that? If needed can provide nvidia-bug-report.sh logs.
Yeah, true. I’ve noticed that one for quite some time now, when the system boots. Anything you’d recommend for that?
Perhaps I re-share the bug report when the said issues re-occur. I’m not sure how far back the logs are collected in this report or if they somehow got cleared.
E.g. one of the errors that I kept note of was: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:0:4048:4040
This happened when the driver crashed (saying cannot find GPU or sth like that) during the workload, and I rebooted the laptop.
This issue is only present when nvidia.ko is loaded. So it is not obvious whether Intel is the culprit. Intel is just displaying more debugging info. Might very well be just detecting some mis-behaviour from nvidia.ko
Maybe, but if the nvidia driver can take down the intel driver while the nvidia gpu is in offload mode and sleeping, this points to some issue with the intel driver, the nvidia driver rather being the trigger, not the reason. Since the OP had a specific kernel version when this started, the usual way to find the issue would be doing a kernel bisect.