Series 550 freezes laptop

they haven’t solved it

3 Likes

I will stick to 535.171.04 and kernel 6.8.7, it works great.

They haven’t solved it. Nvidia 550.xx drivers cause kernel panic on my laptop with rtx 3050.

Hi All,
The freeze issue which I observed locally on my setup has a different stack which has been reported here.
I have tried repro since then on couple of systems for few days where I tried updating packages followed by multiple reboots/logouts but not observed freeze issue.

Would appreciate if anyone has reliable repro steps for the freeze issue.
Also, for isolation purpose, can someone please try below steps and confirm if solves the issue.

  1. Add"zswap.enabled=0" and “numa=off” to the kernel parameter
  2. Changed the nvidia to nvidia-open-dkms
1 Like

The nvidia-open-dkms solved for me the random freezes and the system runs stable with them now. But I cannot uses suspend with them. X11 doesn’t come up again. When I try to reboot, systemd says that nvidia-suspence is blocking that.
I am using NixOS, not Arch (because it seems like everyone else in here uses Arch or an Arch-based distro).

2 Likes

Hello @amrits, regarding the freezing i myself cannot tell if its somehow related to the kernel panics too that have been laid out here in this discussion too, my example here just in photo though…, since in my case the system when powering off would at times trigger in panic rendering completely frozen, and so unable to even use REISUB.

Ever since adding

“amdgpu.vm_update_mode=3 amdgpu.dcdebugmask=0x4”

kernel parameters i have not had kernel panics (so far), but i just wanted to ask like, has this been observed too?

as @bakabo8 pointed out, I tried the 550.78 (open) drivers .
Those fixed the issue(s) for me! (4050, gentoo, hwprobe, emerge --info)

What I tested:

  • Boot with “default” commandline/kernelsettings
  • nvidia-smi works now. Didn’t work with closed 550.76 drivers, as the module(s) failed to work correctly (didn’t test without uvm, as @demensdeum suggested, as this has drawbacks). Turning on pci debugging showed some errors related to vulkaninfo/eglinfo blocking? (on 550.76 closed) I’d guess this is why nvidia-smi or other applications were stuck. dmesg 550.76-closed at 18:03:45
  • prime-run valheim max settings, no dlss@1080p (~55 avg fps, no huge drops in starting area) didn’t see any issues in a quick test
  • nvidia/pci power management seems to finally work again, going to D3COLD when dGPU is not in use. It didn’t work reliable at all on 550.xx yet for me, so I downgraded to “stable” 535.171.04(closed)
  • suspend also works in combination with acpi_osi=! "acpi_osi=Windows 2015" bootparameter ("Windows 2015" is correct for me, you need to lookup yours). I sadly can’t remember, but either on 550.76-open/closed or 535.171-open the first suspend (with dGPU locked in D0 after resume) succeeded and the second to third suspend resulted in a kernel panic. (for reproducing).

However:

  • lots of warnings due to nvidia-open 0x0[dot]st/XHek.txt
  • glxgears has strange output ibb[dot]co/2tKYRgy
  • will do some more testing/benchmarking, but I will probably downgrade to 535.xx-closed again for now.

edit: Not fixed for nvidia-550.78-closed.

What works on closed:

  • Boot->startx->nvidia-smi
  • I’m not sure about initial pci power state.

What didn’t work on closed:

  • starting valheim as before(gets stuck), starting nvidia-smi after this again(now stuck, not responding, ctrl+c not working) dmesg 550.78-closed (similar to above 550.76 dmesg) - I guess you can reproduce it with other directx/vulkan/3d applications/games that utilize the dGPU by prime-run
  • /sys/bus/pci/devices/0000:01:00.0/power/runtime_status remains active
  • edit: shutdown. System was stuck on ending user processes. SysRq r+e+i didn’t solve it either. Had to SysRq s+u+b in order to reboot forcefully. edit2: SysRq + o didn’t shutdown the system, I had to press power key for 10 sec.

edit: numa=off didn’t change anything as it seems (compared to above 550.78-closed run). bpa[dot]st/MGCA

ps.: nvidia-smi does not show “X” as application anymore, not sure if this is 550.xx, or xorg related.

1 Like

In my use case on my Lenovo Legion 5 I needed the nvidia-open-dkms(as I use custom kernel) and “amdgpu.vm_update_mode=3 amdgpu.dcdebugmask=0x4” in order to resolve the problems entirely.

Not sure if we do discuss several different but similar 550.xx related 3xxx 4xxx laptop issues here, as I’m not experiencing the same random freeze as in initial post. Gentoo irc/devs linked me to this thread and said it’s related.

For completion to my previous report I added all of the above amdgpu.vm_update_mode=3 amdgpu.dcdebugmask=0x4 zswap.enabled=0 numa=off , but they didn’t change my issues on 550.78-closed or 550.78-open.

Makes sense, mb there are also issues from mesa smh with dualgpu systems?
But anyhow we need to get the facts straight to at least help others in the future too xp.

edit: forgot to add, i also am just using 550-76 dkms now, endeavourOS has this package that does things for you and so they still include the dkms and i just sit with it xp.

1 Like

I don’t think amdgpu.vm_update_mode=3 amdgpu.dcdebugmask=0x4 is related to series 550. That is an amdgpu bug that has been around more than a year, before nvidia 550 series debuted. But I will test the option if it actually matters.

EDIT1: nvidia proprietary driver 550.78 crashed with zswap and numa enabled, was doing some CUDA tasks.

EDIT2: It crashed with zswap disabled. I deleted nvidia-open section cause I only tested it for a few minutes.

1 Like

You need to try install packages that request a hardware reload system manager configuration, so, you can test with supergfxctl.

1 Like

@mario156090
I compiled package supergfxctl and used the utility to switch between Hybrid and Integrated mode multiple times but still not observe freeze issue.
How frequent it was for you to duplicate issue?

Always happened while install or upgrade the packages. This is happening i’m arch linux. Did You test with arch too or You are using another OS?

1 Like

You can also try running some both CPU and GPU-intensive tasks while upgrading the packages, suspending and resuming, or just plugging and unplugging the power supply multiple times. Also it usually needed to be observed more than 30 minutes.

1 Like

In my case it did not happen due to a specific package. It was randomly when doing some update/installation of packages and at the point of “reloading system manager configuration”. But it didn’t always happen.

I also had some random freezes on the linux boot screen. I don’t remember if it was at any specific point.

It’s something random and I didn’t see a way to force them. I’m sorry I can’t provide more information but I can’t risk losing the system again. With version 545 it has never happened to me.

1 Like

For me i had it happen during shutdown, just now, after switching from VFIO mode to Hybrid mode, vfio might be a bit weird to deal with not going to lie, and this case its not really representative.
However now i have most issues during shutdown, i cannot tell if all are related to nvidia, however, but yesterday and today i had more problems, yesterday being a blackscreen, and today at shutdown a freeze or even kernel panic, as REISUB would not work again…

Here is updated bug report (v550.76):
nvidia-bug-report.log.gz (4.0 MB)

1 Like

In my case steps to reproduce are somewhat finicky. Nevertheless it happens every time within 48 hours. I’m on razer blade 15 with intel10875H and 3080. (Supports hybrid graphics)

Steps to reproduce:

  • vanilla / simple / default installation of arch linux
  • install sudo pacman -Suuyy nvidia nvidia-utils nvidia-settings nvidia-prime cuda cuda-tools cudnn (not dkms)
  • grub drm enabled GRUB_CMDLINE_LINUX_DEFAULT="cryptdevice=/dev/md/root:cryptroot:allow-discards nvidia_drm.modeset=1 loglevel=3"
  • default installation of KDE plasma with Wayland session (now default for KDE)
  • use external monitor via thunderbolt (I don’t think this step matters)
  • in hybrid graphics mode (I don’t think this step matters)
  • within 48 hours running sudo pacman -Suuyy will result in “hard freeze” during Reloading system manager configuration… phase. System is completely unresponsive. Can’t change tty. Can not ssh into the system from another machine. Only option is to shutdown via long press of power button. Upon reboot pacman packages are corrupted. Sometimes system does not even boot. It does not happen every time. But it happens reliably withing a day or two at most. Happened to me three times within 6 days. No crashes / freezes since I switched to Nouveau. (It means no CUDA though 😭)

I know arch folks are looking into something that is posibly related: [SOLVED] Kernel Panic during Update D: / Newbie Corner / Arch Linux Forums

Let me know if there is anything I can do to help debug.

Happy hacking !

What utility did you use to switch between hybrid and integrated? Asking because majority of these utils only support Xorg. I suspect it is easier to reproduce with Wayland session.

Its called supergfxctl, its from asus-ctl team: supergfxctl official.
Its now possible to just use with switcheroo-control.
It supports Hybrid, Integrated and VFIO, as well as a couple specific others.

There is also an unofficial package on the AUR.

1 Like