I have posted this on reddit as well but I’ll post here as well in the hopes of someone being able to help out with the issue.
Hey all,
I’m having an issue with my laptop, after installing the nvidia driver my internal screen glitches out.
I have tried two distro’s and they both have the same issue sadly.
Arch
Opensuse
The installed nvidia driver was the repository default one nvidia-545. With this I have alsl tried setting the kernel parameters fbdev and modeset to both 1 and 0
Currently I am running ubuntu 22.04 with the 535 driver without any issue but when the laptop goes to sleep and comes back the screen glitches as seen in the video.
My hardware:
HP omen 16.1" (16-wf0xxx with bios F.13)
I7 13700HX
32GB DDR5
RTX 4080
16.1" IPS 1440p@240Hz
(arandr shows the internal display as DP-0 so I assume its internally connected as displayport)
In order to prevent messing with hybrid graphics I have set the display mode to “dedicated” in the UEFI instead of “Advanced Optimus”
The graphical artifacts only appear on the laptop screen and not on my second display, they do not appear in the UEFI, nor on a cli distro (arch install image for example) and neither does it happen in Windows 11.
When the glitching happens I also do not notice any performance differences.
Is this just NVIDIA being NVIDIA or is there something that I can try?
Thanks for reading!
note
The attached log is from the ubuntu installation that has the issue when the system comes back from sleep
Might be an issue with the nvidia driver with the high refresh rate screen when switching power levels. To test,
does it also happen when switched to 60Hz?
does switching to “prefer maximum performance” in nvidia-setting prevent it when running at 240Hz?
how many displayport lanes at what speed are reported in nvidia-settings?
Checking the logs again, I found that after sleep the nvidia gpu is coming back with weird values set:
Temperature
GPU Current Temp : 48 C
GPU T.Limit Temp : 39 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
resulting in
Clocks Event Reasons
SW Power Cap : Active
SW Thermal Slowdown : Active
I checked and I am already running the latest bios version from 24 November 2023. But aren’t those values weird? I mean, this log was made on a fresh boot not after a sleep state.
The issue also doesn’t arise when I am booted in windows, no matter sleep or not. When my internet is back (outage…) I’ll reinstall Arch and make a new log dump with the newer driver just to gather more information
In that state, the nvidia gpu can’t be reasonably used. AFAIK, the values are read from the VBIOS so there’s something really broken with that. Don’t know why the Windows driver works with that.
Maybe the issue is with how the driver reads those values? I am getting these results on GPU-Z under windows and that’s quite the difference . If the linux drivers reads a GPU Thermal Limit of 39c and the windows one reads one of 87c it sounds like it’s driver related and not vbios related?
Rather the intermediate layer, how the driver reads the vbios, likely through acpi provided by the system bios. Also, the Windows driver and the Linux driver have different preferences about the method to load the vbios used. Furthermore, Windows is known to handle acpi differently, working on broken implementations. “Works with Windows, ship it!”.
I’ll just make a new dump with the 545 driver under Arch when I get my internet connection back. (A fiber line underground broke…) Hopefully that’s soon and then I’ll update with a new logfile
Okay! So I was able to get an OpenSUSE Tumbleweed installation going again with NVIDIA driver version: 545.29.06 and now I somehow do NOT have these glitches anymore. What I however did note was this error:
localhost.localdomain kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
It’s a 50% chance of a successful boot, with that I mean that the x server starts or the system just keeps spewing the message mentioned above. Even if the system boots successfully the messages are still there, just not as many. I did generate a new bug report which I attached to this message.
The fun part is that the “GPU T.Limit Temp : 36 C” is different than what the nvidia-settings program reports, it reports a thermal limit of 97c or shouldn’t I read the “GPU T.Limit” as a gpu thermal limit?