Random freezes rtx 4070 due to kworker acpi interrupts in ubuntu 6.8

Hi! I have Asus Tuf 15 (FA507UI) with installed PopOS, ubuntu based linux kernel (6.8.0-76060800daily20240311-generic).

I’m running the system in the hybrid gpu mode, which means gpu should be in standby most of the time.

I experience random microfreezes with 100% CPU spikes. Usually 1 core is responsible, but the actual core may change from time to time. I visually saw the spikes in the performance monitor GUI, and recorded the timestamps with CPU number, for reasons I explain below.

First of all, I’m attaching the result of the nvidia-bug-report.sh. Interestingly, freezes were also occuring during the run of the utility, so this may help.

I used atop to detect which process causes the spikes, it turned out to be kworker/*:*-kac..., that’s why later I filter the processes causing the spikes by kworker.

I used while true; do sleep 0.3; kill -USR1 9101; done to trigger more frequent snapshots from atop → this resulted into a 400mb file, which I can attach if needed.

I also used /proc/sysrq-trigger to record the backtraces when the problem happens. Exact one-liner is this:

while true; do sleep 0.1; if [[ $(top -bn1 -o '%CPU' | tail -n+8 | head -n1  | awk '$9 ~ /100/ && $12 ~ /kworker/ {print $9,$12}' | wc -l) = 1 ]]; then echo l > /proc/sysrq-trigger; echo "trigger `date`"; fi; done

Since I knew the timestamps and cpu numbers, I could extract the stacktraces from dmesg corresponding to the particular kernel. See the file attached.

I would really appreciate any advice on how to fix those micro-freezes, and still keep the nvidia gpu available.

Thank you!

nvidia-bug-report.log.gz (741.1 KB)
backtrace.txt (15.5 KB)

I’d like to share an update. I also tried to run the gpu in the Compute mode => it should just be standby all the time, I’m not using it. But it is identified by the system nvidia-smi shows it without errors.

Micro-freezes are not gone. Still the same kworker process. See example of the backtrace below:

[ 1099.879043] CPU: 1 PID: 8780 Comm: kworker/1:4 Tainted: P           OE      6.8.0-76060800daily20240311-generic #202403110203~1714077665~22.04~4c8e9a0
[ 1099.879045] Hardware name: ASUSTeK COMPUTER INC. ASUS TUF Gaming A15 FA507UI_FA507UI/FA507UI, BIOS FA507UI.309 03/06/2024
[ 1099.879046] Workqueue: kacpi_notify acpi_os_execute_deferred
[ 1099.879049] RIP: 0010:_nv041366rm+0x3b/0x80 [nvidia]
[ 1099.879222] Code: d3 89 de 48 8d 55 0f c6 45 0f 00 e8 9f ae 59 ff 80 7d 0f 00 41 89 c4 75 11 41 39 5d 10 76 20 49 8b 45 00 c1 eb 02 44 8b 24 98 <5b> 44 89 e0 41 5c 41 5d 48 83 c5 10 c3 0f 1f 84 00 00 00 00 00 be
[ 1099.879223] RSP: 0018:ffffac8d8b37f810 EFLAGS: 00000212
[ 1099.879224] RAX: ffffac8d88000000 RBX: 0000000000210040 RCX: 0000000000000000
[ 1099.879225] RDX: ffff92a6ac905e0f RSI: 0000000000840100 RDI: ffff92a548820008
[ 1099.879226] RBP: ffff92a6ac905e00 R08: 0000000000000000 R09: 0000000000000000
[ 1099.879226] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1099.879227] R13: ffff92a548820be8 R14: 0000000000000000 R15: 0000000000000000
[ 1099.879227] FS:  0000000000000000(0000) GS:ffff92a86e280000(0000) knlGS:0000000000000000
[ 1099.879228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1099.879229] CR2: 00007a14c1c1ca70 CR3: 00000001a0eec000 CR4: 0000000000f50ef0
[ 1099.879230] PKRU: 55555554
[ 1099.879230] Call Trace:
[ 1099.879231]  <NMI>
[ 1099.879232]  ? show_regs+0x6d/0x80
[ 1099.879235]  ? nmi_cpu_backtrace+0xb5/0x120
[ 1099.879237]  ? sched_clock_noinstr+0x9/0x10
[ 1099.879239]  ? nmi_cpu_backtrace_handler+0x11/0x20
[ 1099.879240]  ? nmi_handle+0x64/0x180
[ 1099.879242]  ? default_do_nmi+0x47/0x130
[ 1099.879243]  ? exc_nmi+0x1c2/0x290
[ 1099.879244]  ? end_repeat_nmi+0xf/0x60
[ 1099.879247]  ? _nv041366rm+0x3b/0x80 [nvidia]
[ 1099.879397]  ? _nv041366rm+0x3b/0x80 [nvidia]
[ 1099.879542]  ? _nv041366rm+0x3b/0x80 [nvidia]
[ 1099.879683]  </NMI>
[ 1099.879684]  <TASK>
[ 1099.879685]  ? _nv014101rm+0x10f/0x170 [nvidia]
[ 1099.879919]  ? _nv024750rm+0x6d/0xa0 [nvidia]
[ 1099.880152]  ? _nv038433rm+0x99/0x1a0 [nvidia]
[ 1099.880381]  ? _nv037768rm+0x204/0x270 [nvidia]
[ 1099.880576]  ? _nv037766rm+0xd2/0x380 [nvidia]
[ 1099.880769]  ? _nv038349rm+0x145/0x440 [nvidia]
[ 1099.880958]  ? _nv026652rm+0x2b7/0x970 [nvidia]
[ 1099.881184]  ? _nv026653rm+0x15e/0x390 [nvidia]
[ 1099.881408]  ? _nv026511rm+0x152/0x3b0 [nvidia]
[ 1099.881624]  ? _nv026611rm+0x54/0x130 [nvidia]
[ 1099.881838]  ? _nv000747rm+0x1fc/0x220 [nvidia]
[ 1099.881992]  ? _nv000698rm+0x1a3/0x300 [nvidia]
[ 1099.882140]  ? rm_transition_dynamic_power+0xd2/0x127 [nvidia]
[ 1099.882287]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[ 1099.882291]  ? nv_pmops_runtime_resume+0xc2/0x100 [nvidia]
[ 1099.882403]  ? pci_pm_runtime_resume+0xa0/0x100
[ 1099.882405]  ? __rpm_callback+0x4d/0x170
[ 1099.882408]  ? rpm_callback+0x6d/0x80
[ 1099.882409]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[ 1099.882411]  ? rpm_resume+0x594/0x7e0
[ 1099.882413]  ? __pm_runtime_resume+0x4e/0x80
[ 1099.882415]  ? pci_device_shutdown+0x23/0x90
[ 1099.882416]  ? nv_indicate_not_idle+0x2b/0x40 [nvidia]
[ 1099.882529]  ? _nv041586rm+0xf4/0x240 [nvidia]
[ 1099.882673]  ? _nv000779rm+0x43/0x70 [nvidia]
[ 1099.882816]  ? rm_acpi_nvpcf_notify+0x56/0xe0 [nvidia]
[ 1099.882959]  ? nv_acpi_nvpcf_event+0x40/0x50 [nvidia]
[ 1099.883073]  ? acpi_ev_notify_dispatch+0x56/0xa0
[ 1099.883077]  ? acpi_os_execute_deferred+0x17/0x40
[ 1099.883078]  ? process_one_work+0x16c/0x350
[ 1099.883082]  ? worker_thread+0x306/0x440
[ 1099.883084]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1099.883085]  ? _raw_spin_lock_irqsave+0xe/0x20
[ 1099.883088]  ? __pfx_worker_thread+0x10/0x10
[ 1099.883089]  ? kthread+0xef/0x120
[ 1099.883092]  ? __pfx_kthread+0x10/0x10
[ 1099.883093]  ? ret_from_fork+0x44/0x70
[ 1099.883096]  ? __pfx_kthread+0x10/0x10
[ 1099.883097]  ? ret_from_fork_asm+0x1b/0x30
[ 1099.883102]  </TASK>

I’m posting the stacktrace explicitly here, so that it might help search engines to index the request.

UPD: I can confirm that the freezes don’t occur when I fully disabled the nvidia (nvidia-smi can’t find the gpu)

Hi @vindex10
Please help to share reliable repro steps and how long does it take to reproduce it.

Hi @amritis. Thank you for your attention.

The thing is that these micro-freezes are not predictable. So the steps are:

  1. make sure nvidia driver is loaded (nvidia-smi displays information about the gpu)
  2. turn on some music (doesnt matter headphones, speakers). it shouldn’t be related, it is just the easiest way to spot the freezes
  3. just relax and wait until you hear the music stutters (it depends, but i start noticing after few minutes, you can’t miss it)