GeForce RTX 2070 not working properly with ubuntu 22.04 on kernel 5.15

I’ve recently faced quite some issues with my NVIDIA GeForce RTX 2070 Super with Max-Q Design & linux kernel 5.15.0-89-generic.

It started when I somehow upgraded my kernel to 5.15.0-89-generic, while still on ubuntu 20.04. My system would frequently freeze, and be slow to shutdown or to start. Before the kernel update, I was using driver version 470 with toolkit 11.4, and everything was working as expected.

After sort of isolating the issue down to the graphics card (e.g. no problems noticed when fully switching to internal intel GPU with prime-select) I decided to upgrade ubuntu altogether.

On 22.04 I’ve been running on driver version 525 & toolkit 12.0. This seems to have fixed my issues up to a degree. While my system hasn’t completely frozen yet (instead it freezes few milliseconds now & then), it is still slow to start or shutdown (sometimes I have to shutdown by going into tty or at other times only alt + PrtSc + REISUB will help), and sometimes the GPU even crashes altogether (meaning that the nvidia driver/tookit crash).

I also experimented with kernel 6.2.0-37-generic but this was even worse. First of all, only the latest drivers (545) & toolkit (12.3) would work at all. But then anytime I gave any serious workload to the GPU it would simply crash. Basically not usable at all.

Important to note that I also have an external 3090 RTX GPU, which, however, hasn’t exhibited any of the issues that 2070 has.

So I guess what I’m looking for is some sort of confirmation that the above reasoning makes sense, and it indeed could be a driver/tookit issue with RTX 2070. Are there some other driver/toolkit combos I could still try for more stability with ubuntu 22.04 with kernel 5.15?

I’m also a bit worried that this could be a physical issue with the GPU itself. Any tests I could run for that? If needed can provide nvidia-bug-report.sh logs.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Thanks for the response. Attached it.

nvidia-bug-report.log.gz (1.7 MB)

Actually, I can only see recurring issues with the intel igpu:

[ 8257.412209] [drm:lspcon_init [i915]] *ERROR* Failed to probe lspcon
[ 8257.412285] [drm:lspcon_resume [i915]] *ERROR* LSPCON init failed on port D

Yeah, true. I’ve noticed that one for quite some time now, when the system boots. Anything you’d recommend for that?

Perhaps I re-share the bug report when the said issues re-occur. I’m not sure how far back the logs are collected in this report or if they somehow got cleared.

E.g. one of the errors that I kept note of was: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:0:4048:4040

This happened when the driver crashed (saying cannot find GPU or sth like that) during the workload, and I rebooted the laptop.

Then create a new log when it happens thhe next time.

Hi @generix ,

created a new one, please find it attached. Some errors:

[nvidia-bug-report.log.gz|attachment](upload://u56GIljqMTf2xe9YryAeEJZHYgl.gz) (718.9 KB)

or

*** /var/lib/dkms/nvidia/525.147.05/5.15.0-89-generic/x86_64/log/make.log
*** ls: -rw-r--r-- 1 root root 1147548 2023-11-24 23:52:35.344875736 +0100 /var/lib/dkms/nvidia/525.147.05/5.15.0-89-generic/x86_64/log/make.log
DKMS make.log for nvidia-525.147.05 for kernel 5.15.0-89-generic (x86_64)
Fr 24. Nov 23:52:10 CET 2023
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-89-generic'
test -e include/generated/autoconf.h -a -e include/config/auto.conf || (		\
echo >&2;							\
echo >&2 "  ERROR: Kernel configuration is invalid.";		\
echo >&2 "         include/generated/autoconf.h or include/config/auto.conf are missing.";\
echo >&2 "         Run 'make oldconfig && make prepare' on kernel src to fix it.";	\
echo >&2 ;							\
/bin/false)
make -f ./scripts/Makefile.build obj=/var/lib/dkms/nvidia/525.147.05/build \

But these two errors were there even in the initial logs.

Is ubuntu 22.04 running on kernel 5.15.0-89-generic actually compatible with nvidia driver 525.147.05 & cuda 12.0?

nvidia-bug-report.log.gz (718.9 KB)

Again, no nvidia related erros in the log.
The error you see from compile is always displayed, ignore.
This rather looks like an intel i915 bug:
https://gitlab.freedesktop.org/drm/intel/-/issues/4458

Yes.

hm. I see. Thanks a lot for your efforts in the support :).

This issue is only present when nvidia.ko is loaded. So it is not obvious whether Intel is the culprit. Intel is just displaying more debugging info. Might very well be just detecting some mis-behaviour from nvidia.ko

Maybe, but if the nvidia driver can take down the intel driver while the nvidia gpu is in offload mode and sleeping, this points to some issue with the intel driver, the nvidia driver rather being the trigger, not the reason. Since the OP had a specific kernel version when this started, the usual way to find the issue would be doing a kernel bisect.

Still haven’t been able to do anything about this one. I’ve just noticed that freezes are more likely if I’m using the NVIDIA GPU…

Checked for a bios update meanwhile?

@generix thanks for the quick response!

Yes, ofc. I actually updated it sometime after this problem started occurring.

I wonder if I should just keep trying different versions of the kernel & nvidia drivers until I find something stable.

Although your point actually was that this likely is an issue with the intel GPU.

Can confirm same problem both with kernel 6.2 & 6.5