Series 550 freezes laptop

Me too, after every tests with 550.67 I come back to 545.29. Is stable for me.

I can confirm that I also experience random hard freezes/lockups with the 550 driver, at one point I thought my laptop has issues.
Laptop is a lenovo 5 15arh7h with an rtx 3060.

2 Likes

I had problems with 550 on my Lenov until I turned on hybrid mode and switched to nvidia prime in nvidia-settings instead

As a followup to my previous reply, i reinstalled my distro and have been using pretty much ever since a 2/3weeks or so.
Had 2 freezes, or more so black screen issues, but i am going to tick these out as a more random and ‘could be something else’ problem.

But today my PC had a complete kernel panic for some reason rendering the laptop to not even force reboot and only being able to force shutdown with the power button.

Aditional info,
-Using the nvidia-inst script;
-Using Supergfxctl;
-RTD3 works, but sometimes i also leave gpu on for external display;
-nvidia-suspend and so on services, except powerd are all disabled.

Unfornatly i don’t know how to trace the kernel panic issue, but i will leave a photo and the bug report showing what is about, and seems also similiar to the ones some others had here.

nvidia-bug-report.log.gz (1.5 MB)

I was able to pin point the reason for my crashes and turns out it was a bug in AMD driver for the iGPU(which even when turned off does something in the background).
Since my Lenovo Legion 5 is a hybrid setup(iGPU + eGPU) it always goes through the internal AMD 680m which has some freezing documented well here:
https://wiki.archlinux.org/title/AMDGPU#Freezes_with_“[drm]_IP_block:gmc_v8_0_is_hung!”_kernel_error

https://bbs.archlinux.org/viewtopic.php?id=288083

Now I am not sure if this would help out any of you guys, but setting both of these:
amdgpu.vm_update_mode=3 amdgpu.dcsebugmask=0x4
in the kernel(whether systemd or grub) it has worked like a charm for 2 days without a single freeze.
I haven’t noticed any kernel panics as well, that is with 550.67

1 Like

Problems seems to happen on Arch based systems. I tested on Fedora Bazzite with 550.67 and kernel 6.8. No issues. On same Legion 7 slim Amd laptop freezes with CachyOs (also using 6.8). It would be interesting to test on OpenSuse Tumbleweed or Solus.
I have cpu Amd 7840hs with igpu Amd 780m and dgpu Nvidia rtx4060. Hybrid mode. Both systemd-boot and grub installs are affected. If I uninstall Nvidia drivers, the system is very stable, no issues at all. I also tried using the new NVK (Nvidia Vulkan driver, open source) as Cachy has a recent Mesa 24 version.

you mean dcdebugmask right?

Yes, pardon the typo.

1 Like

Experiencing kernel oops/panics trying to shutdown with my system staying at the desktop environment, not shutting down fully, or freezing with a 2070 Max-Q and version 550.67. I haven’t been able to generate a bug report file, but here’s a dmesg log:

[20095.569011] BUG: unable to handle page fault for address: 0000000000002029
[20095.569017] #PF: supervisor read access in kernel mode
[20095.569019] #PF: error_code(0x0000) - not-present page
[20095.569021] PGD 0 P4D 0 
[20095.569024] Oops: 0000 [#1] PREEMPT SMP NOPTI
[20095.569026] CPU: 11 PID: 1758 Comm: systemd Tainted: P           OE      6.8.1-1-default #1 openSUSE Tumbleweed a408dede100ecd8172a7eae2d0778227ac69e46d
[20095.569030] Hardware name: Micro-Star International Co., Ltd. GL75 Leopard 10SFK/MS-17E7, BIOS E17E7IMS.10B 10/23/2020
[20095.569032] RIP: 0010:rb_first+0xf/0x30
[20095.569040] Code: 10 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 74 14 48 89 c2 <48> 8b 40 10 48 85 c0 75 f4 48 89 d0 c3 cc cc cc cc 31 d2 eb f4 66
[20095.569042] RSP: 0018:ffffba618349bc30 EFLAGS: 00010202
[20095.569045] RAX: 0000000000002019 RBX: ffff9cb650402600 RCX: 00000000010000fe
[20095.569047] RDX: 0000000000002019 RSI: 0000000000000000 RDI: ffff9cb6049e42b8
[20095.569048] RBP: ffff9cb650402880 R08: ffff9cb606823ad0 R09: 00000000010000fe
[20095.569050] R10: 00000000010000fe R11: ffffba618349bc30 R12: 0000000000000000
[20095.569051] R13: ffff9cb6049e42b8 R14: ffff9cb623a9c000 R15: ffffffffb9dedd58
[20095.569053] FS:  00007f08bebcb900(0000) GS:ffff9cb94e580000(0000) knlGS:0000000000000000
[20095.569055] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20095.569056] CR2: 0000000000002029 CR3: 000000010d508004 CR4: 00000000007706f0
[20095.569058] PKRU: 55555554
[20095.569059] Call Trace:
[20095.569063]  <TASK>
[20095.569066]  ? __die+0x23/0x70
[20095.569072]  ? page_fault_oops+0x14d/0x490
[20095.569076]  ? __traverse_mounts+0x134/0x210
[20095.569080]  ? copy_from_kernel_nofault+0x21/0xe0
[20095.569083]  ? exc_page_fault+0x71/0x160
[20095.569087]  ? asm_exc_page_fault+0x26/0x30
[20095.569094]  ? rb_first+0xf/0x30
[20095.569097]  simple_xattrs_free+0x29/0x90
[20095.569100]  kernfs_put.part.0+0x60/0x150
[20095.569104]  kernfs_remove_by_name_ns+0x81/0xd0
[20095.569108]  cgroup_addrm_files+0x28b/0x300
[20095.569113]  ? __filename_parentat+0xe9/0x210
[20095.569116]  ? generic_permission+0x39/0x220
[20095.569118]  css_clear_dir+0x4b/0xc0
[20095.569122]  cgroup_destroy_locked+0xcc/0x1b0
[20095.569125]  cgroup_rmdir+0x2b/0xd0
[20095.569127]  kernfs_iop_rmdir+0x50/0x80
[20095.569129]  vfs_rmdir+0x97/0x200
[20095.569131]  do_rmdir+0x17d/0x190
[20095.569135]  __x64_sys_rmdir+0x42/0x70
[20095.569137]  do_syscall_64+0x86/0x170
[20095.569141]  ? syscall_exit_to_user_mode+0x80/0x230
[20095.569143]  ? do_syscall_64+0x96/0x170
[20095.569146]  ? syscall_exit_to_user_mode+0x80/0x230
[20095.569148]  ? do_syscall_64+0x96/0x170
[20095.569150]  ? do_syscall_64+0x96/0x170
[20095.569152]  ? do_syscall_64+0x96/0x170
[20095.569154]  ? __irq_exit_rcu+0x3b/0xb0
[20095.569159]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[20095.569162] RIP: 0033:0x7f08be505ebb
[20095.569198] Code: 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 90 90 b8 54 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 29 8f 0e 00 f7 d8
[20095.569200] RSP: 002b:00007fff066749c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[20095.569202] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f08be505ebb
[20095.569204] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00005560c772e200
[20095.569205] RBP: 00005560c76a91b0 R08: 00007f08be8bd6be R09: 0000000000000007
[20095.569206] R10: 00005560c76bb780 R11: 0000000000000246 R12: 00007f08beac6f3e
[20095.569207] R13: 0000000000000001 R14: 0000000000000000 R15: 00005560c772e200
[20095.569210]  </TASK>
[20095.569211] Modules linked in: af_packet rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device ccm algif_aead des3_ede_x86_64 des_generic libdes md4 nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables qrtr cmac algif_hash algif_skcipher af_alg bnep nls_iso8859_1 nls_cp437 vfat fat iwlmvm snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel mac80211 snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp libarc4 snd_sof snd_sof_utils soundwire_generic_allocation soundwire_bus snd_hda_codec_realtek snd_soc_skl snd_hda_codec_generic snd_soc_hdac_hda snd_hda_ext_core snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common snd_compress snd_pcm_dmaengine btusb intel_uncore_frequency intel_uncore_frequency_common uvcvideo snd_hda_intel btrtl intel_tcc_cooling
[20095.569253]  snd_intel_dspcfg btintel videobuf2_vmalloc snd_intel_sdw_acpi iwlwifi btbcm uvc x86_pkg_temp_thermal snd_hda_codec intel_powerclamp ucsi_ccg videobuf2_memops btmtk r8169 spi_nor bluetooth typec_ucsi videobuf2_v4l2 snd_hda_core iTCO_wdt typec realtek intel_pmc_bxt snd_hwdep ext4 cfg80211 mtd kvm_intel roles mei_hdcp videodev ee1004 mei_pxp snd_pcm iTCO_vendor_support rtsx_usb_ms ecdh_generic mdio_devres mbcache kvm msi_wmi snd_timer spi_intel_pci i2c_i801 mei_me videobuf2_common snd i2c_nvidia_gpu irqbypass sparse_keymap wmi_bmof libphy pcspkr efi_pstore memstick spi_intel i2c_smbus mc i2c_ccgx_ucsi mei soundcore jbd2 rfkill thermal intel_pch_thermal intel_pmc_core gpio_keys ac intel_vsec pmt_telemetry pmt_class acpi_pad soc_button_array joydev tiny_power_button fuse nvme_fabrics configfs nfnetlink dmi_sysfs ip_tables x_tables rtsx_usb_sdmmc mmc_core usbhid rtsx_usb ahci libahci libata hid_multitouch hid_generic crct10dif_pclmul crc32_pclmul sd_mod polyval_clmulni polyval_generic scsi_dh_emc gf128mul
[20095.569308]  scsi_dh_rdac nvme scsi_dh_alua xhci_pci xhci_pci_renesas sg ghash_clmulni_intel nvme_core xhci_hcd scsi_mod sha512_ssse3 sha256_ssse3 intel_lpss_pci sha1_ssse3 nvme_auth intel_lpss i2c_hid_acpi aesni_intel usbcore crypto_simd cryptd scsi_common t10_pi idma64 i2c_hid battery pinctrl_cannonlake button serio_raw mxm_wmi nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(POE) nvidia(POE) i915 i2c_algo_bit drm_buddy video wmi ttm drm_display_helper cec rc_core btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq coretemp br_netfilter bridge stp llc pkcs8_key_parser msr efivarfs
[20095.569342] CR2: 0000000000002029
[20095.569345] ---[ end trace 0000000000000000 ]---
[20095.569346] RIP: 0010:rb_first+0xf/0x30
[20095.569349] Code: 10 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 74 14 48 89 c2 <48> 8b 40 10 48 85 c0 75 f4 48 89 d0 c3 cc cc cc cc 31 d2 eb f4 66
[20095.569351] RSP: 0018:ffffba618349bc30 EFLAGS: 00010202
[20095.569353] RAX: 0000000000002019 RBX: ffff9cb650402600 RCX: 00000000010000fe
[20095.569354] RDX: 0000000000002019 RSI: 0000000000000000 RDI: ffff9cb6049e42b8
[20095.569355] RBP: ffff9cb650402880 R08: ffff9cb606823ad0 R09: 00000000010000fe
[20095.569357] R10: 00000000010000fe R11: ffffba618349bc30 R12: 0000000000000000
[20095.569358] R13: ffff9cb6049e42b8 R14: ffff9cb623a9c000 R15: ffffffffb9dedd58
[20095.569359] FS:  00007f08bebcb900(0000) GS:ffff9cb94e580000(0000) knlGS:0000000000000000
[20095.569361] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20095.569362] CR2: 0000000000002029 CR3: 000000010d508004 CR4: 00000000007706f0
[20095.569364] PKRU: 55555554
[20095.569365] note: systemd[1758] exited with irqs disabled

The same on Manjaro Linux, G14 GA401QM with 3060 Mobile, KDE 5.27.11-1/KDE 6.0.2-3, X11

I had upgraded linux66-nvidia from 545.29.06-36 to 550.54.14-2 and got first unhandled page fault after a reboot. Now it’s reproducing randomly every 2-20 hours of uptime. I get “unable to handle page fault for address…”. Addresses very similar: 0000000000009b8b/000000000001ba19/00000000000100f3/000000000003520a/0000000000018415/0000000000021052/etc… But stack traces differ

There are 13 crash messages from my laptop: Crashes caused by nvidia 550 driver - Pastebin.com

If I switch to nouveau or disable my dGPU by using nomodeset or by switching to iGPU, the system becomes stable, no page faults for 2+ days. Upgrading to linux66-nvidia 550.67-1 (testing branch) doesn’t work. Neither switching to linux61 nor linux68 works

UPD: Also, two of these page faults have resulted in a corruption of rootfs btrfs volume ¯\_(ツ)_/¯

1 Like

I have once again upgraded to 550.67 to give it another shot. It was mostly stable.

Every shutdown would cause kernel panics, sometimes even booting would resulting in multiple failed boot attempts. Today, the laptop hanged and kernel panicked while upgrading the system with pacman at Reloading system manager configuration, resulting into corrupt packages and now generally doubting about the integrity of the system.

Although, I haven’t any random kernel panics during general usage, or when plugging and unplugging the laptop.

Got annoyed so downgraded, once again, Back to 545. but before doing so I took the last 2 kernel logs and generated an nvidia bug report.

nvidia-bug-report.log.gz (2.0 MB)
kernel.log (7.8 KB)
kernel2.log (5.3 KB)

1 Like

@amrits hello, what about this problem?
Is there some fix?

Just to toss my 5 cents here. Shame on you Nvidia.

I am also having issues on upstream Linux Arch with the nvidia driver. I’ve experienced multiple hangs on Reloading system manager configuration step of pacman -Syu. I’ve lost multiple hours figuring out how to fix the system after such failures - since this is one of the first steps that happens, I always had to force reboot effectively putting the system in an inconsistent state. Frequently that has ended up in an unbootable situation which I had to resolve by live booting, tracing back what was going to be installed, and forcing reinstall.

I must admit I started to fear doing system upgrades (almost like Pavlov dogs) since there was a pretty high chance that things go sideways. At the same time postponing the upgrade meant there were more packages to fix if something went wrong…

I’ve recently tossed out the Nvidia driver - since I didn’t bother downgrading since your drivers supposedly need matching kernel version - and go figure, the hangs during pacman system upgrades went away

Maybe Linus wasn’t wrong after all?

10 Likes

eta for 555 beta drivers is 15.5, maybe theyre going to fix this issue with it.

I have same issue on my new laptop with amd+nvidia hybrid graphics, linux and nvidia 550 driver.
OS: Arch Linux x86_64
Kernel: 6.8.2
CPU: AMD Ryzen 9 7940HS w/ Radeon 780M
GPU: NVIDIA GeForce RTX 4060 Max-Q / Mobile
GPU: AMD ATI 64:00.0 Phoenix1
For the last month I have gotten 3 times when my system crashed with error

Apr 03 11:38:47 cosx kernel: BUG: kernel NULL pointer dereference, address: 00000000000000a5
Apr 03 11:38:47 cosx kernel: #PF: supervisor read access in kernel mode
Apr 03 11:38:47 cosx kernel: #PF: error_code(0x0000) - not-present page
Apr 03 11:38:47 cosx kernel: PGD 0 P4D 0
Apr 03 11:38:47 cosx kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 03 11:38:47 cosx kernel: CPU: 13 PID: 1329 Comm: systemd Tainted: P OE 6.8.2-arch2-1 #1 a430fb92f7ba43092b62bbe6bac995458d3d442d
Apr 03 11:38:47 cosx kernel: Hardware name: ASUSTeK COMPUTER INC. Vivobook_ASUSLaptop M6500XV_M6500XV/M6500XV, BIOS M6500XV.309 12/04/2023
Apr 03 11:38:47 cosx kernel: RIP: 0010:rb_first+0xf/0x30

One time on system boot, one time on system poweroff and one time on kernel and nvidia driver upgrade.
After system upgrade my package manager was broken so I had to reinstall the system

1 Like

In my case I use kde with wayland session and external monitor by nvidia hdmi port

Can you please try the fix I suggested earlier as it seems very similar to what I experienced on my setup.

For reference mine is a Lenovo Legion 5:
CPU: Ryzen 7 6800H / Radeon 680M
GPU: NVIDIA GeForce RTX 3060 Max-Q / Mobile
RAM: 16GB DDR5

Fix is to enter this in your systemd or grub(whatever you use) as kernel module parameters
amdgpu.vm_update_mode=3 amdgpu.dcdebugmask=0x4

Please update if it worked or not, that resolved the problem for me and haven’t had a single freeze anymore.

1 Like

Some Kdump investigation shows that the NVIDIA kernel module seems writes an incorrect memory address value to a kernfs_node->iattr->xattrs->rb_root->rb_node, and finally a program reads it causing a kernel panic. https://fars.ee/qsH9

I upload the relevant Kdump file below: 703.43 MB folder on MEGA
Hope this helps with the investigation.

3 Likes

I am experiencing this problem on a Thinkpad T15p, a laptop with an Intel CPU/iGPU and an NVIDIA dGPU (3050 Laptop). Setting AMD-related kernel parameters is not an applicable solution for everyone.

2 Likes

That is fully understandable, but the user tm4ig seems to have similar problem to mine and is again with amd iGPU, which is why I suggested him that.

1 Like