384.11 driver crashes kernel

EDIT: It’s actually the 387.34 driver, but the previous driver had the same problem.

$ inxi -bxx
System:    Host: ronin Kernel: 4.14.0-12.1-liquorix-amd64 x86_64 bits: 64 gcc: 7.2.0
           Desktop: Xfce 4.12.4 (Gtk 2.24.31) dm: lightdm Distro: Debian GNU/Linux buster/sid
Machine:   Device: desktop Mobo: ASUSTeK model: ROG STRIX B350-F GAMING v: Rev X.0x serial: N/A
           UEFI [Legacy]: American Megatrends v: 3401 date: 12/04/2017
CPU:       Quad core AMD Ryzen 5 1500X (-HT-MCP-) arch: Zen rev.1 speed/max: 1550/3500 MHz
Graphics:  Card: NVIDIA GK208 [GeForce GT 730] bus-ID: 08:00.0 chip-ID: 10de:1287
           Display Server: x11 (X.Org 1.19.5 ) driver: nouveau Resolution: 1280x1024@60.02hz
           OpenGL: renderer: NV106 version: 4.3 Mesa 17.3.1 (compat-v: 3.0) Direct Render: Yes
Network:   Card: Intel I211 Gigabit Network Connection
           driver: igb v: 5.4.0-k port: e000 bus-ID: 03:00.0 chip-ID: 8086:1539
Drives:    HDD Total Size: 370.1GB (53.9% used)
Info:      Processes: 284 Uptime: 9:38 Memory: 4289.6/7976.9MB Init: systemd v: 236 runlevel: 5 Gcc sys: 7.2.0
           Client: Shell (bash 4.4.121 running in xfce4-terminal) inxi: 2.3.45

This problem made my computer unbootable. Because DKMS installs the driver to all kernels, none of my installed kernels would boot properly. I reinstalled Debian. Then started using sgfxi instead to install the drivers without DKMS, so I can boot another kernel to remove the drivers.

Oddly enough, Debian doesn’t come with persistent logs turned on, but I managed to enable them, and this is the report I get:

Jan 08 22:47:03 ronin kernel: divide error: 0000 [#1] PREEMPT SMP 
Jan 08 22:47:03 ronin kernel: Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq edac_mce_amd kvm_amd kvm eeepc_wmi asus_wmi sparse_keymap irqbypass rfkill
Jan 08 22:47:03 ronin kernel:  crc32c_intel i2c_piix4 libata i2c_algo_bit dca ptp xhci_pci pps_core scsi_mod xhci_hcd rtc_cmos gpio_amdpt gpio_generic i2c_designware_platform i2c_d 
Jan 08 22:47:03 ronin kernel: CPU: 2 PID: 347 Comm: systemd-udevd Not tainted 4.14.0-11.1-liquorix-amd64 #1 liquorix 4.14-14 
Jan 08 22:47:03 ronin kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B350-F GAMING, BIOS 3401 12/04/2017 
Jan 08 22:47:03 ronin kernel: task: ffff88020eb73500 task.stack: ffffc900014c8000 
Jan 08 22:47:03 ronin kernel: RIP: 0010:nvGetClocks+0x176/0x260 [nvidiafb] 
Jan 08 22:47:03 ronin kernel: RSP: 0018:ffffc900014cb7f8 EFLAGS: 00010246 
Jan 08 22:47:03 ronin kernel: RAX: 0000000000000000 RBX: ffff8802152e2420 RCX: 0000000000000000 
Jan 08 22:47:03 ronin kernel: RDX: 0000000000000000 RSI: ffffc900014cb834 RDI: ffff8802152e2420 
Jan 08 22:47:03 ronin kernel: RBP: ffff8802152e2518 R08: ffffc900014cb838 R09: 0000000000000000 
Jan 08 22:47:03 ronin kernel: R10: 0000000000000068 R11: 00000000002e18c8 R12: 0000000000062570 
Jan 08 22:47:03 ronin kernel: R13: 000000000000000e R14: 0000000000000010 R15: 0000000000000008 
Jan 08 22:47:03 ronin kernel: FS:  00007fe8ed56b400(0000) GS:ffff88021ec80000(0000) knlGS:0000000000000000 
Jan 08 22:47:03 ronin kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
Jan 08 22:47:03 ronin kernel: CR2: 00007f546a15c870 CR3: 000000020df0a000 CR4: 00000000003406e0 
Jan 08 22:47:03 ronin kernel: Call Trace: 
Jan 08 22:47:03 ronin kernel:  NVCalcStateExt+0x189/0x8e0 [nvidiafb] 
Jan 08 22:47:03 ronin kernel:  nvidiafb_set_par+0x47c/0x9f0 [nvidiafb] 
Jan 08 22:47:03 ronin kernel:  fbcon_init+0x59e/0x780 
Jan 08 22:47:03 ronin kernel:  visual_init+0xca/0x120 
Jan 08 22:47:03 ronin kernel:  do_bind_con_driver+0x2ab/0x640 
Jan 08 22:47:03 ronin kernel:  do_take_over_console+0x22d/0x470 
Jan 08 22:47:03 ronin kernel:  fbcon_event_notify+0x90d/0xa20 
Jan 08 22:47:03 ronin kernel:  blocking_notifier_call_chain+0x5d/0x80 
Jan 08 22:47:03 ronin kernel:  register_framebuffer+0x1d5/0x2f0 
Jan 08 22:47:03 ronin kernel:  nvidiafb_probe+0x6b2/0xa80 [nvidiafb] 
Jan 08 22:47:03 ronin kernel:  pci_device_probe+0x1e4/0x340 
Jan 08 22:47:03 ronin kernel:  driver_probe_device+0x3d4/0x4a0 
Jan 08 22:47:03 ronin kernel:  __driver_attach+0xd1/0xe0 
Jan 08 22:47:03 ronin kernel:  ? driver_probe_device+0x4a0/0x4a0 
Jan 08 22:47:03 ronin kernel:  bus_for_each_dev+0x57/0x80 
Jan 08 22:47:03 ronin kernel:  bus_add_driver+0x191/0x210 
Jan 08 22:47:03 ronin kernel:  driver_register+0x78/0xf0 
Jan 08 22:47:03 ronin kernel:  ? nvidiafb_setcolreg+0x2a0/0x2a0 [nvidiafb] 
Jan 08 22:47:03 ronin kernel:  do_one_initcall+0x46/0x190 
Jan 08 22:47:03 ronin kernel:  do_init_module+0x58/0x2f9 
Jan 08 22:47:03 ronin kernel:  load_module+0x1dfd/0x2760 
Jan 08 22:47:03 ronin kernel:  ? SyS_finit_module+0x91/0xb0 
Jan 08 22:47:03 ronin kernel:  SyS_finit_module+0x91/0xb0 
Jan 08 22:47:03 ronin kernel:  do_syscall_64+0x64/0x190 
Jan 08 22:47:03 ronin kernel:  entry_SYSCALL64_slow_path+0x25/0x25 
Jan 08 22:47:03 ronin kernel: RIP: 0033:0x7fe8ece94da9 
Jan 08 22:47:03 ronin kernel: RSP: 002b:00007ffe2837a368 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 
Jan 08 22:47:03 ronin kernel: RAX: ffffffffffffffda RBX: 000055d1362632f0 RCX: 00007fe8ece94da9 
Jan 08 22:47:03 ronin kernel: RDX: 0000000000000000 RSI: 00007fe8ecb9f2d5 RDI: 0000000000000010 
Jan 08 22:47:03 ronin kernel: RBP: 00007fe8ecb9f2d5 R08: 0000000000000000 R09: 0000000000000000 
Jan 08 22:47:03 ronin kernel: R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 
Jan 08 22:47:03 ronin kernel: R13: 000055d1362591a0 R14: 0000000000020000 R15: 000055d136249140 
Jan 08 22:47:03 ronin kernel: Code: f0 0f 00 00 3d 00 03 00 00 74 73 3d 30 03 00 00 74 6c 41 8b 89 04 05 00 00 0f b6 c5 44 0f b6 c9 c1 e9 10 0f af c2 31 d2 83 e1 0f <41> f7 f1 d3 e 
Jan 08 22:47:03 ronin kernel: RIP: nvGetClocks+0x176/0x260 [nvidiafb] RSP: ffffc900014cb7f8

I tried again today. The current Debian kernel is fine but the Liquorix kernel crashes. This time the log contains one more line.

$ inxi -bxx
System:    Host: ronin Kernel: 4.14.0-3-amd64 x86_64 bits: 64 gcc: 7.2.0
           Desktop: Xfce 4.12.4 (Gtk 2.24.31) dm: lightdm Distro: Debian GNU/Linux buster/sid
Machine:   Device: desktop Mobo: ASUSTeK model: ROG STRIX B350-F GAMING v: Rev X.0x serial: N/A
           UEFI [Legacy]: American Megatrends v: 3401 date: 12/04/2017
CPU:       Quad core AMD Ryzen 5 1500X (-HT-MCP-) arch: Zen rev.1 speed/max: 1550/3500 MHz
Graphics:  Card: NVIDIA GK208 [GeForce GT 730] bus-ID: 08:00.0 chip-ID: 10de:1287
           Display Server: x11 (X.Org 1.19.5 ) driver: nvidia Resolution: 1280x960@60.00hz
           OpenGL: renderer: GeForce GT 730/PCIe/SSE2
           version: 4.5.0 NVIDIA 387.34 (compat-v: 4.6.0) Direct Render: Yes
Network:   Card: Intel I211 Gigabit Network Connection
           driver: igb v: 5.4.0-k port: e000 bus-ID: 03:00.0 chip-ID: 8086:1539
Drives:    HDD Total Size: 370.1GB (37.4% used)
Info:      Processes: 200 Uptime: 6 min Memory: 903.5/7978.8MB Init: systemd v: 236 runlevel: 5 Gcc sys: 7.2.0
           Client: Shell (bash 4.4.121 running in xfce4-terminal) inxi: 2.3.45
Jan 11 15:06:29 ronin kernel: divide error: 0000 [#1] PREEMPT SMP
Jan 11 15:06:29 ronin kernel: Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq edac_mce_amd kvm_amd kvm irqbypass eeepc_wmi asus_wmi sparse_keymap crct10d
Jan 11 15:06:29 ronin kernel:  igb i2c_algo_bit dca crc32c_intel xhci_pci ptp i2c_piix4 pps_core scsi_mod xhci_hcd rtc_cmos gpio_amdpt gpio_generic i2c_designware_platform i2c_desi
Jan 11 15:06:29 ronin kernel: CPU: 6 PID: 352 Comm: systemd-udevd Not tainted 4.14.0-11.1-liquorix-amd64 #1 liquorix 4.14-14
Jan 11 15:06:29 ronin kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B350-F GAMING, BIOS 3401 12/04/2017
Jan 11 15:06:29 ronin kernel: task: ffff88021558ea00 task.stack: ffffc90001564000
Jan 11 15:06:29 ronin kernel: RIP: 0010:nvGetClocks+0x176/0x260 [nvidiafb]
Jan 11 15:06:29 ronin kernel: RSP: 0018:ffffc900015677f8 EFLAGS: 00010246
Jan 11 15:06:29 ronin kernel: RAX: 0000000000000000 RBX: ffff8802148de420 RCX: 0000000000000000
Jan 11 15:06:29 ronin kernel: RDX: 0000000000000000 RSI: ffffc90001567834 RDI: ffff8802148de420
Jan 11 15:06:29 ronin kernel: RBP: ffff8802148de518 R08: ffffc90001567838 R09: 0000000000000000
Jan 11 15:06:29 ronin kernel: R10: 0000000000000068 R11: 00000000002e18c8 R12: 0000000000062570
Jan 11 15:06:29 ronin kernel: R13: 000000000000000e R14: 0000000000000010 R15: 0000000000000008
Jan 11 15:06:29 ronin kernel: FS:  00007f2b66850400(0000) GS:ffff88021ed80000(0000) knlGS:0000000000000000
Jan 11 15:06:29 ronin kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 11 15:06:29 ronin kernel: CR2: 00007ffe18cd42c8 CR3: 000000020df79000 CR4: 00000000003406e0
Jan 11 15:06:29 ronin kernel: Call Trace:
Jan 11 15:06:29 ronin kernel:  NVCalcStateExt+0x189/0x8e0 [nvidiafb]
Jan 11 15:06:29 ronin kernel:  nvidiafb_set_par+0x47c/0x9f0 [nvidiafb]
Jan 11 15:06:29 ronin kernel:  fbcon_init+0x59e/0x780
Jan 11 15:06:29 ronin kernel:  visual_init+0xca/0x120
Jan 11 15:06:29 ronin kernel:  do_bind_con_driver+0x2ab/0x640
Jan 11 15:06:29 ronin kernel:  do_take_over_console+0x22d/0x470
Jan 11 15:06:29 ronin kernel:  fbcon_event_notify+0x90d/0xa20
Jan 11 15:06:29 ronin kernel:  blocking_notifier_call_chain+0x5d/0x80
Jan 11 15:06:29 ronin kernel:  register_framebuffer+0x1d5/0x2f0
Jan 11 15:06:29 ronin kernel:  nvidiafb_probe+0x6b2/0xa80 [nvidiafb]
Jan 11 15:06:29 ronin kernel:  pci_device_probe+0x1e4/0x340
Jan 11 15:06:29 ronin kernel:  driver_probe_device+0x3d4/0x4a0
Jan 11 15:06:29 ronin kernel:  __driver_attach+0xd1/0xe0
Jan 11 15:06:29 ronin kernel:  ? driver_probe_device+0x4a0/0x4a0
Jan 11 15:06:29 ronin kernel:  bus_for_each_dev+0x57/0x80
Jan 11 15:06:29 ronin kernel:  bus_add_driver+0x191/0x210
Jan 11 15:06:29 ronin kernel:  driver_register+0x78/0xf0
Jan 11 15:06:29 ronin kernel:  ? nvidiafb_setcolreg+0x2a0/0x2a0 [nvidiafb]
Jan 11 15:06:29 ronin kernel:  do_one_initcall+0x46/0x190
Jan 11 15:06:29 ronin kernel:  do_init_module+0x58/0x2f9
Jan 11 15:06:29 ronin kernel:  load_module+0x1dfd/0x2760
Jan 11 15:06:29 ronin kernel:  ? SyS_finit_module+0x91/0xb0
Jan 11 15:06:29 ronin kernel:  SyS_finit_module+0x91/0xb0
Jan 11 15:06:29 ronin kernel:  do_syscall_64+0x64/0x190
Jan 11 15:06:29 ronin kernel:  entry_SYSCALL64_slow_path+0x25/0x25
Jan 11 15:06:29 ronin kernel: RIP: 0033:0x7f2b6617cda9
Jan 11 15:06:29 ronin kernel: RSP: 002b:00007ffc10acd648 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
Jan 11 15:06:29 ronin kernel: RAX: ffffffffffffffda RBX: 0000561bfe50b2f0 RCX: 00007f2b6617cda9
Jan 11 15:06:29 ronin kernel: RDX: 0000000000000000 RSI: 00007f2b65e872d5 RDI: 0000000000000010
Jan 11 15:06:29 ronin kernel: RBP: 00007f2b65e872d5 R08: 0000000000000000 R09: 0000000000000000
Jan 11 15:06:29 ronin kernel: R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
Jan 11 15:06:29 ronin kernel: R13: 0000561bfe4f9000 R14: 0000000000020000 R15: 0000561bfe531660
Jan 11 15:06:29 ronin kernel: Code: f0 0f 00 00 3d 00 03 00 00 74 73 3d 30 03 00 00 74 6c 41 8b 89 04 05 00 00 0f b6 c5 44 0f b6 c9 c1 e9 10 0f af c2 31 d2 83 e1 0f <41> f7 f1 d3 e
Jan 11 15:06:29 ronin kernel: RIP: nvGetClocks+0x176/0x260 [nvidiafb] RSP: ffffc900015677f8
Jan 11 15:06:29 ronin kernel: ---[ end trace bd775c73033f12fb ]---
Jan 11 15:06:29 ronin systemd-udevd[313]: worker [352] failed while handling '/devices/pci0000:00/0000:00:03.1/0000:08:00.0'

I ran the bug report script twice, before and after the crash.
nvidia-bug-report-1.log.gz (130 KB)

You attached the same report twice, they’re from 15:01:47, crash occured at 15:06:29
General hints: did you check if you’re affected by the Ryzen bug?
https://github.com/suaefar/ryzen-test
Did you check if the gpu is working in another system? Did an earlier driver version work?

I went through RMA for the segfault bug that script pertains to and I don’t get those anymore, although I do get the random occasional hard crashes that AMD seems to be incapable of fixing. I’ve tried various BIOS settings, and these crashes don’t leave anything of note in the system log. But this is definitely not related.

The GPU works. The driver also works if I start X just after installing without rebooting. Something happens during bootup that crashes the driver. One possibility I can think of is that using a CRT monitor has something to do with it. In the bootup log I get “nvidiafb: unable to detect display type”. By the backtraces it does look like the crash is related to nvidiafb.

I looks like the bug report script will not overwrite a bug report if one already exists, so I ended up copying the file again thinking it had been changed.

Ok, you just found the solution that I didn’t see: nvidiafb is no part of the official nvidia driver. It comes with the kernel and has to be blacklisted like nouveau.