Driver stuck loading on 100% CPU

My new system is stuck on 100% CPU on one core in the nvidia module. It does not even load the driver - seems to be happening in the init code. Can’t unload the driver or use nvidia-smi or related.
This also blocks shutdown.

  • driver: 440.44-r1
  • kernel: 5.4.6
  • GPU: ASUS GeForce® RTX 2080 Ti ROG Strix OC
  • Mainboard: Asus ROG STRIX TRX40-E GAMING

It seems to work with the dual booted windows though (apart from weird display resets with power management)

Any additional info is in the nvidia-bug-report: https://s3-eu-west-1.amazonaws.com/paste.particleflux.codes/misc/nvidia-bug-report.log.gz

The dmesg in the bug-report.gz also shows output of two CPU backtraces I did via sysrq, pointing to the nvidia module causing the kworker 100% CPU.

Here’s one of these backtraces:

[   82.893211] NMI backtrace for cpu 0
[   82.893212] CPU: 0 PID: 267 Comm: kworker/0:2 Tainted: P           O      5.4.6-gentoo #1
[   82.893212] Hardware name: System manufacturer System Product Name/ROG STRIX TRX40-E GAMING, BIOS 0702 12/12/2019
[   82.893212] Workqueue: events work_for_cpu_fn
[   82.893213] RIP: 0010:os_delay+0xfb/0x240 [nvidia]
[   82.893214] Code: 20 49 f7 e4 65 48 8b 04 25 00 5d 01 00 48 c7 40 10 01 00 00 00 48 89 d7 48 c1 ef 12 eb 40 4c 89 f0 48 89 dd 48 29 f0 48 29 d5 <79> 0b 48 83 e8 01 48 81 c5 40 42 0f 00 48 69 c0 40 42 0f 00 48
 01
[   82.893214] RSP: 0018:ffffb37dc0b6bbf0 EFLAGS: 00000212
[   82.893214] RAX: 0000000000000dbf RBX: 00000000000be301 RCX: 0000000000000000
[   82.893214] RDX: 0000000000059ffa RSI: 000000005e0b447e RDI: 000af7a58431a2d0
[   82.893215] RBP: 0000000000064307 R08: 000000134cc8d115 R09: ffffb37dc0b6bb88
[   82.893215] R10: ffffa3c74de27880 R11: 0000000000000001 R12: 431bde82d7b634db
[   82.893215] R13: 20c49ba5e353f7cf R14: 000000005e0b523d R15: ffffa3c746977831
[   82.893215] FS:  0000000000000000(0000) GS:ffffa3c74de00000(0000) knlGS:0000000000000000
[   82.893215] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   82.893216] CR2: 00007efed7ec4192 CR3: 0000000b0be0a000 CR4: 0000000000340eb0
[   82.893216] Call Trace:
[   82.893216]  _nv030277rm+0x2fa/0x360 [nvidia]
[   82.893216]  ? os_pci_read_dword+0xd/0x20 [nvidia]
[   82.893216]  ? _nv031101rm+0x82/0x130 [nvidia]
[   82.893216]  ? _nv000891rm+0x5c/0x1a0 [nvidia]
[   82.893217]  ? rm_get_gpu_uuid+0x44/0x1e0 [nvidia]
[   82.893217]  ? proc_register+0xee/0x160
[   82.893217]  ? nv_control_irq+0x53c/0xed0 [nvidia]
[   82.893217]  ? __switch_to_asm+0x40/0x70
[   82.893217]  ? __switch_to_asm+0x34/0x70
[   82.893217]  ? __switch_to_asm+0x40/0x70
[   82.893217]  ? local_pci_probe+0x3d/0x70
[   82.893218]  ? __schedule+0x28c/0x5a0
[   82.893218]  ? work_for_cpu_fn+0x11/0x20
[   82.893218]  ? process_one_work+0x1db/0x380
[   82.893218]  ? worker_thread+0x1f5/0x3c0
[   82.893218]  ? kthread+0xf6/0x130
[   82.893218]  ? process_one_work+0x380/0x380
[   82.893218]  ? kthread_park+0x80/0x80
[   82.893219]  ? ret_from_fork+0x22/0x40

Any idea on which direction to troubleshoot?

Still not solved. Here is a sysrq “blocked-tasks” backtrace:

https://s3.eu-west-1.amazonaws.com/paste.particleflux.codes/img/nvidia-blocked-small.jpg

More stuff tried:

  • Updated the kernel to 5.4.12
  • updated the BIOS
  • changed the PCI slot
  • unplugged the monitors and plugged a HDMI monitor in