My new system is stuck on 100% CPU on one core in the nvidia module. It does not even load the driver - seems to be happening in the init code. Can’t unload the driver or use nvidia-smi or related.
This also blocks shutdown.
- driver: 440.44-r1
- kernel: 5.4.6
- GPU: ASUS GeForce® RTX 2080 Ti ROG Strix OC
- Mainboard: Asus ROG STRIX TRX40-E GAMING
It seems to work with the dual booted windows though (apart from weird display resets with power management)
Any additional info is in the nvidia-bug-report: https://s3-eu-west-1.amazonaws.com/paste.particleflux.codes/misc/nvidia-bug-report.log.gz
The dmesg in the bug-report.gz also shows output of two CPU backtraces I did via sysrq, pointing to the nvidia module causing the kworker 100% CPU.
Here’s one of these backtraces:
[ 82.893211] NMI backtrace for cpu 0
[ 82.893212] CPU: 0 PID: 267 Comm: kworker/0:2 Tainted: P O 5.4.6-gentoo #1
[ 82.893212] Hardware name: System manufacturer System Product Name/ROG STRIX TRX40-E GAMING, BIOS 0702 12/12/2019
[ 82.893212] Workqueue: events work_for_cpu_fn
[ 82.893213] RIP: 0010:os_delay+0xfb/0x240 [nvidia]
[ 82.893214] Code: 20 49 f7 e4 65 48 8b 04 25 00 5d 01 00 48 c7 40 10 01 00 00 00 48 89 d7 48 c1 ef 12 eb 40 4c 89 f0 48 89 dd 48 29 f0 48 29 d5 <79> 0b 48 83 e8 01 48 81 c5 40 42 0f 00 48 69 c0 40 42 0f 00 48
01
[ 82.893214] RSP: 0018:ffffb37dc0b6bbf0 EFLAGS: 00000212
[ 82.893214] RAX: 0000000000000dbf RBX: 00000000000be301 RCX: 0000000000000000
[ 82.893214] RDX: 0000000000059ffa RSI: 000000005e0b447e RDI: 000af7a58431a2d0
[ 82.893215] RBP: 0000000000064307 R08: 000000134cc8d115 R09: ffffb37dc0b6bb88
[ 82.893215] R10: ffffa3c74de27880 R11: 0000000000000001 R12: 431bde82d7b634db
[ 82.893215] R13: 20c49ba5e353f7cf R14: 000000005e0b523d R15: ffffa3c746977831
[ 82.893215] FS: 0000000000000000(0000) GS:ffffa3c74de00000(0000) knlGS:0000000000000000
[ 82.893215] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 82.893216] CR2: 00007efed7ec4192 CR3: 0000000b0be0a000 CR4: 0000000000340eb0
[ 82.893216] Call Trace:
[ 82.893216] _nv030277rm+0x2fa/0x360 [nvidia]
[ 82.893216] ? os_pci_read_dword+0xd/0x20 [nvidia]
[ 82.893216] ? _nv031101rm+0x82/0x130 [nvidia]
[ 82.893216] ? _nv000891rm+0x5c/0x1a0 [nvidia]
[ 82.893217] ? rm_get_gpu_uuid+0x44/0x1e0 [nvidia]
[ 82.893217] ? proc_register+0xee/0x160
[ 82.893217] ? nv_control_irq+0x53c/0xed0 [nvidia]
[ 82.893217] ? __switch_to_asm+0x40/0x70
[ 82.893217] ? __switch_to_asm+0x34/0x70
[ 82.893217] ? __switch_to_asm+0x40/0x70
[ 82.893217] ? local_pci_probe+0x3d/0x70
[ 82.893218] ? __schedule+0x28c/0x5a0
[ 82.893218] ? work_for_cpu_fn+0x11/0x20
[ 82.893218] ? process_one_work+0x1db/0x380
[ 82.893218] ? worker_thread+0x1f5/0x3c0
[ 82.893218] ? kthread+0xf6/0x130
[ 82.893218] ? process_one_work+0x380/0x380
[ 82.893218] ? kthread_park+0x80/0x80
[ 82.893219] ? ret_from_fork+0x22/0x40
Any idea on which direction to troubleshoot?