GPF when closing Chrome with slub_debug=P enabled on 465.19.01+ and 470.42.01

Hello,

Running a fresh installation of Arch Linux – tested kernel version 5.12.13 and 5.10.46 – I’m getting a general protection fault upon closing any Chrome-based rendering application. This includes the browsers Chromium and Opera, and the text editor Atom. This happens running KDE and GNOME under Xorg.

The specific slub_debug poision option that reliably triggers the issue every time on this specific system, is slub_debug=P,kmalloc-1k.

When the crash happens, the systems usually hangs immediately afterwards, and in the rare cases where I’ve been able to remotely ssh into the system, running the bug report-tool simply hangs and produces nothing, even with --safe-mode enabled. This means that I’ve unfortunately not been able to properly run the nvidia-bug-report.sh tool, nor attach any such logs it should’ve produced.
The closest I’ve come to producing any sort of log, is by grabbing the output of dmesg remotely. Though, I feel like there’s something wrong with the format of the log, I have reproduced the output in-line at the end of this thread.

The issue is present in any recent driver branch other than the 460 series, including all versions of 465 (versions tested 465.19.01 to 465.31), and the latest 470 branch (latest tested driver as of writing, 470.42.01).

System information:

Rampage III Extreme, BIOS 1502
Intel(R) Core(TM) i7 CPU 950  @ 3.07GHz
GeForce GTX 660

# uname -r
5.12.13-arch1-2

# lspci -nn -d 10de:
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK106 [GeForce GTX 660] [10de:11c0] (rev a1)

dmesg output:

[    9.049402] nvidia: loading out-of-tree module taints kernel.
[    9.049470] nvidia: module license 'NVIDIA' taints kernel.
[    9.049523] Disabling lock debugging due to kernel taint
[    9.059982] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    9.083400] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[    9.084224] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[   10.599152] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  465.19.01  Fri Mar 19 07:44:41 UTC 2021
[   10.641404] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  465.19.01  Fri Mar 19 07:51:16 UTC 2021
[   10.646844] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[   10.802803] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000dffff window]
[   10.802885] caller _nv000712rm+0x1af/0x200 [nvidia] mapping multiple BARs
[   11.370964] e1000e 0000:00:19.0 enp0s25: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   11.371102] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s25: link becomes ready
[   12.695893] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 0
[   34.015105] general protection fault, probably for non-canonical address 0x6b6b6b6b00000000: 0000 [#1] PREEMPT SMP NOPTI
[   34.015113] CPU: 4 PID: 1044 Comm: opera Tainted: P          IOE     5.12.13-arch1-2 #1
[   34.015117] Hardware name: System manufacturer System Product Name/Rampage III Extreme, BIOS 1502    10/03/2011
[   34.015119] RIP: 0010:_nv035609rm+0xb0/0xe0 [nvidia]
[   34.015630] Code: 89 c2 48 89 ef 48 8d b1 48 01 00 00 4c 89 e9 e8 46 5b ff ff 66 0f 1f 44 00 00 48 89 ef e8 a8 5b ff ff 84 c0 74 8a 48 8b 75 00 <48> 39 5e 08 75 ea 4c 39 26 75 e5 49 8b 44 24 20 48 8d b8 48 01 00
[   34.015633] RSP: 0018:ffffb1d842c8bbd8 EFLAGS: 00010202
[   34.015636] RAX: 0000000000000001 RBX: ffff9b3b3329f830 RCX: ffff9b3ae3620178
[   34.015638] RDX: 6b6b6b6b6b6b6b6b RSI: 6b6b6b6b00000000 RDI: ffff9b3ae83c5d28
[   34.015641] RBP: ffff9b3ae83c5d28 R08: 0000000000000020 R09: ffff9b3ae83c5d30
[   34.015643] R10: ffff9b3ad0704008 R11: 0000000000000001 R12: ffff9b3ae7517aa8
[   34.015645] R13: 6b6b6b6b00000000 R14: ffff9b3ae83c5da8 R15: ffff9b3b3329f830
[   34.015647] FS:  00007fb5d1e5ccc0(0000) GS:ffff9b3fe7b00000(0000) knlGS:0000000000000000
[   34.015650] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.015652] CR2: 00007f5c38f2a910 CR3: 000000016da10000 CR4: 00000000000006e0
[   34.015655] Call Trace:
[   34.015659]  ? _nv013909rm+0x2f4/0x780 [nvidia]
[   34.016186]  ? _nv037555rm+0xb3/0x150 [nvidia]
[   34.016651]  ? _nv037554rm+0x297/0x4e0 [nvidia]
[   34.017219]  ? _nv037549rm+0x60/0x70 [nvidia]
[   34.017686]  ? _nv037550rm+0x7b/0xb0 [nvidia]
[   34.018161]  ? _nv035818rm+0x40/0xe0 [nvidia]
[   34.018562]  ? _nv000689rm+0x67/0xa0 [nvidia]
[   34.018918]  ? rm_cleanup_file_private+0xea/0x140 [nvidia]
[   34.019246]  ? nvidia_close+0x150/0x310 [nvidia]
[   34.019465]  ? nvidia_frontend_close+0x2b/0x50 [nvidia]
[   34.019682]  ? __fput+0x8c/0x230
[   34.019687]  ? task_work_run+0x5c/0x90
[   34.019690]  ? do_exit+0x375/0xae0
[   34.019695]  ? do_group_exit+0x33/0xa0
[   34.019698]  ? __x64_sys_exit_group+0x14/0x20
[   34.019701]  ? do_syscall_64+0x33/0x40
[   34.019706]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   34.019710] Modules linked in: nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) rfkill iTCO_wdt intel_powerclamp intel_pmc_bxt coretemp iTCO_vendor_support mxm_wmi kvm_intel btrfs kvm snd_hda_codec_hdmi snd_hda_intel irqbypass intel_cstate snd_intel_dspcfg snd_intel_sdw_acpi snd_oxygen blake2b_generic intel_uncore i2c_i801 pcspkr xor snd_hda_codec i2c_smbus drm_kms_helper snd_oxygen_lib raid6_pq snd_mpu401_uart snd_hda_core snd_rawmidi snd_hwdep snd_seq_device mousedev libcrc32c snd_pcm lpc_ich cec snd_timer snd syscopyarea sysfillrect sysimgblt e1000e fb_sys_fops soundcore i7core_edac i5500_temp asus_atk0110 wmi mac_hid acpi_cpufreq drm sg crypto_user fuse agpgart bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 ata_generic pata_acpi crc32c_intel firewire_ohci firewire_core sr_mod xhci_pci usbhid cdrom crc_itu_t pata_jmicron xhci_pci_renesas
[   34.019784] ---[ end trace 2a91fbd79fb61ddc ]---
[   34.019787] RIP: 0010:_nv035609rm+0xb0/0xe0 [nvidia]
[   34.020183] Code: 89 c2 48 89 ef 48 8d b1 48 01 00 00 4c 89 e9 e8 46 5b ff ff 66 0f 1f 44 00 00 48 89 ef e8 a8 5b ff ff 84 c0 74 8a 48 8b 75 00 <48> 39 5e 08 75 ea 4c 39 26 75 e5 49 8b 44 24 20 48 8d b8 48 01 00
[   34.020186] RSP: 0018:ffffb1d842c8bbd8 EFLAGS: 00010202
[   34.020188] RAX: 0000000000000001 RBX: ffff9b3b3329f830 RCX: ffff9b3ae3620178
[   34.020190] RDX: 6b6b6b6b6b6b6b6b RSI: 6b6b6b6b00000000 RDI: ffff9b3ae83c5d28
[   34.020191] RBP: ffff9b3ae83c5d28 R08: 0000000000000020 R09: ffff9b3ae83c5d30
[   34.020193] R10: ffff9b3ad0704008 R11: 0000000000000001 R12: ffff9b3ae7517aa8
[   34.020194] R13: 6b6b6b6b00000000 R14: ffff9b3ae83c5da8 R15: ffff9b3b3329f830
[   34.020196] FS:  00007fb5d1e5ccc0(0000) GS:ffff9b3fe7b00000(0000) knlGS:0000000000000000
[   34.020198] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.020200] CR2: 00007f5c38f2a910 CR3: 000000016da10000 CR4: 00000000000006e0
[   34.020202] Fixing recursive fault but reboot is needed!
[   82.989715] BUG: kernel NULL pointer dereference, address: 0000000000000400
[   82.989722] #PF: supervisor read access in kernel mode
[   82.989726] #PF: error_code(0x0000) - not-present page
[   82.989729] PGD 13ef39067 P4D 13ef39067 PUD 0 
[   82.989735] Oops: 0000 [#2] PREEMPT SMP NOPTI
[   82.989739] CPU: 1 PID: 984 Comm: QSGRenderThread Tainted: P      D   IOE     5.12.13-arch1-2 #1
[   82.989744] Hardware name: System manufacturer System Product Name/Rampage III Extreme, BIOS 1502    10/03/2011
[   82.989747] RIP: 0010:_nv009366rm+0x3c/0x340 [nvidia]
[   82.990459] Code: 07 0f 1f 44 00 00 31 d2 48 8b 07 48 85 c0 75 1a e9 a1 02 00 00 66 0f 1f 84 00 00 00 00 00 48 8b 48 10 48 85 c9 74 17 48 89 c8 <48> 39 30 77 ef 0f 83 29 02 00 00 48 8b 48 18 48 85 c9 75 e9 48 89
[   82.990463] RSP: 0018:ffffb1d842a0bdc0 EFLAGS: 00010006
[   82.990467] RAX: 0000000000000400 RBX: ffffb1d842a0bdf8 RCX: 0000000000000400
[   82.990471] RDX: ffffb1d842a0be48 RSI: 00000000000003d8 RDI: ffffffffc2c56958
[   82.990474] RBP: ffff9b3ad8765ff8 R08: ffff9b3ac2d21c80 R09: 0000000000000010
[   82.990477] R10: ffff9b3ac2d21c80 R11: 0000000000000001 R12: 6b6b6b6b6b6b6b6b
[   82.990480] R13: ffff9b3ac2d21c80 R14: ffff9b3ae8a42000 R15: 0000000000000052
[   82.990483] FS:  00007fc18de5d640(0000) GS:ffff9b3fe7a40000(0000) knlGS:0000000000000000
[   82.990487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   82.990491] CR2: 0000000000000400 CR3: 000000012bef6000 CR4: 00000000000006e0
[   82.990494] Call Trace:
[   82.990498]  ? _nv039600rm+0xdf/0x1e0 [nvidia]
[   82.990924]  ? rm_ioctl+0x3c/0xb0 [nvidia]
[   82.991541]  ? nvidia_ioctl+0x11c/0x8b0 [nvidia]
[   82.991947]  ? nvidia_ioctl+0x69c/0x8b0 [nvidia]
[   82.992355]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[   82.992758]  ? __x64_sys_ioctl+0x82/0xb0
[   82.992765]  ? do_syscall_64+0x33/0x40
[   82.992772]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   82.992780] Modules linked in: nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) rfkill iTCO_wdt intel_powerclamp intel_pmc_bxt coretemp iTCO_vendor_support mxm_wmi kvm_intel btrfs kvm snd_hda_codec_hdmi snd_hda_intel irqbypass intel_cstate snd_intel_dspcfg snd_intel_sdw_acpi snd_oxygen blake2b_generic intel_uncore i2c_i801 pcspkr xor snd_hda_codec i2c_smbus drm_kms_helper snd_oxygen_lib raid6_pq snd_mpu401_uart snd_hda_core snd_rawmidi snd_hwdep snd_seq_device mousedev libcrc32c snd_pcm lpc_ich cec snd_timer snd syscopyarea sysfillrect sysimgblt e1000e fb_sys_fops soundcore i7core_edac i5500_temp asus_atk0110 wmi mac_hid acpi_cpufreq drm sg crypto_user fuse agpgart bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 ata_generic pata_acpi crc32c_intel firewire_ohci firewire_core sr_mod xhci_pci usbhid cdrom crc_itu_t pata_jmicron xhci_pci_renesas
[   82.992864] CR2: 0000000000000400
[   82.992867] ---[ end trace 2a91fbd79fb61ddd ]---
[   82.992870] RIP: 0010:_nv035609rm+0xb0/0xe0 [nvidia]
[   82.993546] Code: 89 c2 48 89 ef 48 8d b1 48 01 00 00 4c 89 e9 e8 46 5b ff ff 66 0f 1f 44 00 00 48 89 ef e8 a8 5b ff ff 84 c0 74 8a 48 8b 75 00 <48> 39 5e 08 75 ea 4c 39 26 75 e5 49 8b 44 24 20 48 8d b8 48 01 00
[   82.993550] RSP: 0018:ffffb1d842c8bbd8 EFLAGS: 00010202
[   82.993554] RAX: 0000000000000001 RBX: ffff9b3b3329f830 RCX: ffff9b3ae3620178
[   82.993557] RDX: 6b6b6b6b6b6b6b6b RSI: 6b6b6b6b00000000 RDI: ffff9b3ae83c5d28
[   82.993560] RBP: ffff9b3ae83c5d28 R08: 0000000000000020 R09: ffff9b3ae83c5d30
[   82.993562] R10: ffff9b3ad0704008 R11: 0000000000000001 R12: ffff9b3ae7517aa8
[   82.993565] R13: 6b6b6b6b00000000 R14: ffff9b3ae83c5da8 R15: ffff9b3b3329f830
[   82.993568] FS:  00007fc18de5d640(0000) GS:ffff9b3fe7a40000(0000) knlGS:0000000000000000
[   82.993572] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   82.993576] CR2: 0000000000000400 CR3: 000000012bef6000 CR4: 00000000000006e0
[   82.993579] note: QSGRenderThread[984] exited with preempt_count 1

Finally, please excuse my sloppy bug reporting skills, and do please inform me if there’s any other information I can provide to help debug the issue.

Thanks to the previous report of @Linden.Knight, I was able to resolve the very same issue I was observing since May 2021. The freezing issue still occurs in the latest 470.57.02 driver released on 2021-07-19.

I can confirm that removing the hardening option (slub poisioning) slub_debug=P from the kernel command line fixes the freezing issue on my hardened Gentoo Linux set-up. Up to the 460-series driver, slub poisioning worked without any issue, though.

Addionally, I was also able to access my machine once via ssh in an only once in a while occuring semi-frozen state (Xorg frozen but hardware-mouse movable) to receive a dmesg log and an incomplete dump from the nvidia-bug-report.sh utility. Otherwise, in most lock-ups, the whole kernel freezes and the system cannot be accessed at all.

I also observed that the nvidia-bug-report.sh hangs at some point when trying to receive the logs (for me, after obtaining and writing the power/runtime_usage).

Somehow, I managed to obtain a comprehensive but still incomplete nvidia-bug-report.log.gz exceeding the aforementioned point. You will find the incomplete log attached to this post.

nvidia-bug-report.log.gz (40.2 KB)