R38.4.0 Encountered a system crash issue

hi:

客户在使用过程中遇到一个问题,在输入用户名的时候卡住了,过了一段时间重启了日志见附件!

rt的atch已经打上了

[   22.190715] tegra-nvcsi 8181200000.host1x:nvcsi@8188000000: Failed to create device link (0x180) with 9-0066
[   22.190883] g300 9-0067: supply vcc not found, using dummy regulator
[   22.190964] g300 9-0067: pps-gpios not found 0
[   22.262719] debugfs: Directory 'g300_a' with parent '/' already present!
[   22.263593] tegra-nvcsi 8181200000.host1x:nvcsi@8188000000: Failed to create device link (0x180) with 9-0067
[   22.404056] kdump-tools[737]: Starting kdump-tools:
[   22.404312] kdump-tools[788]:  * Invalid kernel version : 6.8.12-rt-tegra
[   22.404526] kdump-tools[788]:  * Invalid symlink : /var/lib/kdump/initrd.img
[   22.404653] kdump-tools[788]:  * Creating symlink /var/lib/kdump/initrd.img
[   22.404702] kdump-tools[788]:  * Invalid symlink : /var/lib/kdump/vmlinuz
[   22.404751] kdump-tools[788]:  * Creating symlink /var/lib/kdump/vmlinuz
[   22.405083] kdump-tools[923]: Can't open (/proc/kcore).
[   22.405474] kdump-tools[923]: Warning, can't get the VA_BITS from kcore
[   22.405527] kdump-tools[923]: Can't open (/proc/kcore).
[   22.405874] kdump-tools[788]:  * loaded kdump kernel

Ubuntu 24.04.3 LTS localhost.localdomain ttyUTC0

localhost login: [   25.059277] debugfs: Directory 'null' with parent '/' already present!
[   25.059431] cnss: Failed to get cooling device node
[   25.104736] [ip][0x17f09d8][00:03:12.609255] wlan: [3559:E:MLO_MGR] mlo_mgr_ml_peer_exist_on_diff_ml_ctx: MLD ID 0 exists with mac a8:dd:9f:cc:b3:71
[   25.156268] warning: `iwconfig' uses wireless extensions which will stop working for Wi-Fi 7 hardware; use nl80211fs file
[   25.458962] ar0234 9-0030: ar0234_open:
[   25.458974] ar0234 9-0032: ar0234_open:
[   25.458983] ar0234 9-0034: ar0234_open:
[   25.458991] ar0234 9-0036: ar0234_open:
[   25.459000] ar0234 9-0040: ar0234_open:
[   25.459009] ar0234 9-0041: ar0234_open:
[   25.648859] [soft_i][0x1875763][00:03:13.153394] wlan: [0:E:REGULATORY] reg_freq_to_chan_for_chlist: invalid frequency 5945
[   26.464935] ar0234 9-0030: ar0234_open:
[   26.464948] ar0234 9-0032: ar0234_open:
[   26.464957] ar0234 9-0034: ar0234_open:
[   26.464968] ar0234 9-0036: ar0234_open:
[   26.464978] ar0234 9-0040: ar0234_open:
[   26.464986] ar0234 9-0041: ar0234_open:
[   27.048965] [schedu][0x19cb492][00:03:14.553505] wlan: [2243:E:PE] lim_intersect_ap_emlsr_caps: mlo peer ctx is null
[   27.091833] [soft_i][0x19d5c06][00:03:14.596373] wlan: [0:E:DP] dp_tx_update_peer_stats: Release source:3 is not from TQM
[   27.127071] BUG: scheduling while atomic: NetworkManager/1356/0x00000005
[   27.128772] WARNING: CPU: 12 PID: 1356 at kernel/softirq.c:148 __local_bh_disable_ip+0xf4/0x130
[   27.129334] WARNING: CPU: 12 PID: 1356 at kernel/rcu/tree_plugin.h:320 rcu_note_context_switch+0x524/0x568
[   27.129456] ---[ end trace 0000000000000000 ]---
[   27.161817] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.162134] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.165445] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.166256] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.167715] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.167982] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.170207] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.170392] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.170776] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   27.171654] NOHZ tick-stop error: local softirq work is pending, handler #80!!!
[   33.300376] platform sound: deferred probe pending: tegra-audio-graph-card: parse error

localhost login:
localhost login:
localhost login: agi
1


[   48.134038] rcu: INFO: rcu_preempt self-detected stall on CPU
[   48.134044] rcu: 	12-...!: (1 GPs behind) idle=bcd4/1/0x4000000000000002 softirq=0/0 fqs=2
[   48.134049] rcu: rcu_preempt kthread timer wakeup didn't happen for 5245 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   48.134051] rcu: 	Possible timer handling issue on cpu=12 timer-softirq=3306
[   48.134053] rcu: rcu_preempt kthread starved for 5246 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=12
[   48.134054] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[   48.134055] rcu: RCU grace-period kthread stack dump:















[  111.154041] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  111.154047] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-13): P1356/2:b..l P3444/1:b..l P3897/1:b..l P3467/1:b..l P3907/1:b..l P668/1:b..l P1195/1:b..l
[  111.154056] rcu: 	(detected by 0, t=21005 jiffies, g=2757, q=11330 ncpus=14)
[  111.154415] rcu: rcu_preempt kthread timer wakeup didn't happen for 15754 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  111.154417] rcu: 	Possible timer handling issue on cpu=12 timer-softirq=3306
[  111.154417] rcu: rcu_preempt kthread starved for 15755 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->stat[  111.154419] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavi[  111.154420] rcu: RCU grace-period kthread stack dump:
[  111.154568] NMI backtrace for cpu 12U GP kthread last ran:








[  174.173913] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  174.173919] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-13): P1356/2:b..l P3444/1:b..l P3897/1:b..l P3467/1:b..l P3907/1:b..l P668/1:b..l P1195/1:b..l
[  174.173928] rcu: 	(detected by 5, t=36760 jiffies, g=2757, q=14244 ncpus=14)
[  174.174275] rcu: rcu_preempt kthread timer wakeup didn't happen for 15754 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  174.174277] rcu: 	Possible timer handling issue on cpu=12 timer-softirq=3306
[  174.174281] rcu: RCU grace-period kthread stack dump:sufficient CPU time, OOM is now expected behavior.x402 ->cpu=12
[  237.194042] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  237.194046] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-13): P1356/2:b..l P3444/1:b..l P3897/1:b..l P3467/1:b..l P3907/1:b..l P668/1:b..l P1195/1:b..l
[  237.194056] rcu: 	(detected by 0, t=52515 jiffies, g=2757, q=16973 ncpus=14)
[  237.194406] rcu: rcu_preempt kthread timer wakeup didn't happen for 15754 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  237.194409] rcu: 	Possible timer handling issue on cpu=12 timer-softirq=3306
[  237.194412] rcu: RCU grace-period kthread stack dump:sufficient CPU time, OOM is now expected behavior.x402 ->cpu=12
[  237.194557] NMI backtrace for cpu 12U GP kthread last ran:


[  300.214044] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  300.214049] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-13): P1356/2:b..l P3444/1:b..l P3897/1:b..l P3467/1:b..l P3907/1:b..l P668/1:b..l P1195/1:b..l
[  300.214059] rcu: 	(detected by 0, t=68270 jiffies, g=2757, q=19321 ncpus=14)
[  300.214410] rcu: rcu_preempt kthread timer wakeup didn't happen for 15754 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  300.214412] rcu: 	Possible timer handling issue on cpu=12 timer-softirq=3306
[  300.214416] rcu: RCU grace-period kthread stack dump:sufficient CPU time, OOM is now expected behavior.x402 ->cpu=12
[  363.234043] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  363.234048] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-13): P1356/2:b..l P3444/1:b..l P3897/1:b..l P3467/1:b..l P3907/1:b..l P668/1:b..l P1195/1:b..l
[  363.234424] rcu: rcu_preempt kthread timer wakeup didn't happen for 15754 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  363.234426] rcu: 	Possible timer handling issue on cpu=12 timer-softirq=3306
[  363.234430] rcu: RCU grace-period kthread stack dump:sufficient CPU time, OOM is now expected behavior.x402 ->cpu=12
[  426.254042] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  426.254047] rcu: 	Tasks blocked on level-0 rcu_node (CPUs 0-13): P1356/2:b..l P3444/1:b..l P3897/1:b..l P3467/1:b..l P3907/1:b..l P668/1:b..l P1195/1:b..l
[  426.254382] rcu: rcu_preempt kthread timer wakeup didn't happen for 15754 jiffies! g2757 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  426.254384] rcu: 	Possible timer handling issue on cpu=12 timer-softirq=3306
[  426.254387] rcu: RCU grace-period kthread stack dump:sufficient CPU time, OOM is now expected behavior.x402 ->cpu=12
�������254410] rcu: Stack dump where RCU GP kthread last ran:
[0000.092] I> MB1 (version: 0.23.0.2-t264-75019003-378e427f)
[0000.092] C> Boot-mode : Coldboot
[0000.092] C> MB1 last_boot_error: 0x0
[0000.092] I> Entry timestamp: 0x00012be6
[0000.094] C> rst_source: 0x33, rst_level: 0x1
[0000.098] I> BR-BCT: preprod_dev_sign: 0
[0000.102] I> Socket mask: 0x1
[0000.105] I> Socket id: 0
[0000.107] I> Chip supports UFS HS mode
[0000.111] I> BR last_boot_error0: 0x0
[0000.114] I> BR last_boot_error1: 0x0
[0000.118] I> BR last_boot_error2: 0x0
[0000.121] I> NVBCT is initialized

115200.log (366.0 KB)

Hi mingming,

Do you mean there’s still the rcu stall issue after the patch has been applied?

Could you share the result of the following command on your board after you have applied the patch and enabled rt-kernel?

$ modinfo nvidia | egrep 'vermagic'

hi:

yes,已经打过补丁了

modinfo nvidia | egrep "vermagic"—-》

img_v3_02uk_16bf68fa-9ef2-42f4-aa20-388ec458e3cg

We don’t hit such rcu stall issue on the devkit after applying the patch.
Please help to clarify if the issue is specific to your custom board or may be caused from any custom module in your case.

hi:

是我们自己定制的板子,然后客户在测试上电的时候 输入用户名和密码出现的

The rcu stall issue may not relate to entering the username/password to log-in.
From the log you shared, it seems you have many custom kernel modules.
Please help to check if any of them may cause the rcu stall.

I am aware that you are on rt kernel. This post presents nearly identical dmesg errors.

rcu-info-rcu-preempt-self-detected-stall-on-cpu-unable-to-access-the-system-system-freeze

Here’s an excerpt from on of that posters logs

[2026-01-26 10:04:15]  [  236.349119] rcu: INFO: rcu_preempt self-detected stall on CPU
[2026-01-26 10:04:37]  [  236.349120] rcu: 4-....: (52508 ticks this GP) idle=b254/1/0x4000000000000000 softirq=1494/1494 fqs=18372
[2026-01-26 10:04:37]  [  236.349122] rcu: (t=52509 jiffies g=125 q=94245 ncpus=14)
[2026-01-26 10:04:37]  [  238.573103] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 4-.... } 53064 jiffies s: 841 root: 0x10/.
[2026-01-26 10:04:39]  [  238.573112] rcu: blocking rcu_node structures (internal RCU debug):
[2026-01-26 10:04:39]  [  242.669118] INFO: task kworker/11:0:72 blocked for more than 120 seconds.
[2026-01-26 10:04:43]  [  242.669142]       Tainted: G        W  OE      6.8.12-tegra #1
[2026-01-26 10:04:43]  [  242.669155] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.675538] INFO: task kworker/u33:0:89 blocked for more than 120 seconds.
[2026-01-26 10:04:43]  [  242.682417]       Tainted: G        W  OE      6.8.12-tegra #1
[2026-01-26 10:04:43]  [  242.688353] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2026-01-26 10:04:43]  [  242.696136] INFO: task kworker/u31:2:151 blocked for more than 120 seconds.
[2026-01-26 10:04:43]  [  242.703020]       Tainted: G        W  OE      6.8.12-tegra #1
[2026-01-26 10:04:43]  [  242.708957] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.716839] INFO: task kworker/2:2:153 blocked for more than 120 seconds.
[2026-01-26 10:04:43]  [  242.723629]       Tainted: G        W  OE      6.8.12-tegra #1
[  242.729215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2026-01-26 10:04:43]  [  242.737348] INFO: task kworker/u32:1:300 blocked for more than 120 seconds.
[2026-01-26 10:04:43]  [  242.744232]       Tainted: G        W  OE      6.8.12-tegra #1
[  242.749817] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2026-01-26 10:04:43]  [  242.757914] INFO: task NetworkManager:974 blocked for more than 120 seconds.
[2026-01-26 10:04:43]  [  242.764834]       Tainted: G        W  OE      6.8.12-tegra #1
[2026-01-26 10:04:43]  [  242.770770] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2026-01-26 10:04:43]  [  242.778570] INFO: task systemd-timedat:1565 blocked for more than 120 seconds.
[2026-01-26 10:04:43]  [  242.785787]       Tainted: G        W  OE      6.8.12-tegra #1
[2026-01-26 10:04:43]  [  242.791723] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2026-01-26 10:04:43]  [  242.799458] INFO: task (udev-worker):1634 blocked for more than 120 seconds.
[  242.806392]       Tainted: G        W  OE      6.8.12-tegra #1
[2026-01-26 10:04:43]  [  242.812326] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.820064] INFO: task (udev-worker):1637 blocked for more than 120 seconds.
[  242.826996]       Tainted: G        W  OE      6.8.12-tegra #1
[2026-01-26 10:04:43]  [  242.832931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.840663] INFO: task (udev-worker):1639 blocked for more than 121 seconds.
[  242.847600]       Tainted: G        W  OE      6.8.12-tegra #1
[2026-01-26 10:04:43]  [  242.853538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

hi:

这块还请帮忙查查是哪个影响的,我们不懂这个rcu 是什么逻辑,如何影响更不清楚了,从现象看就是在输入用户名的时候卡住了,这个问题之前的版本也有问题,我感觉是解决的不彻底呢

source/kernel/kernel-noble/Documentation/RCU/whatisRCU.rst
source/kernel/kernel-noble/tools/testing/selftests/bpf/benchs/run_bench_local_storage_rcu_tasks_trace.sh
source/kernel/kernel-noble/tools/testing/selftests/bpf/benchs/bench_local_storage_rcu_tasks_trace.c
/* local-storage-tasks-trace: Benchmark performance of BPF local_storage's use
 * of RCU Tasks-Trace.
 *
 * Stress RCU Tasks Trace by forking many tasks, all of which do no work aside
 * from sleep() loop, and creating/destroying BPF task-local storage on wakeup.
 * The number of forked tasks is configurable.
 *
 * exercising code paths which call call_rcu_tasks_trace while there are many
 * thousands of tasks on the system should result in RCU Tasks-Trace having to
 * do a noticeable amount of work.
 *
 * This should be observable by measuring rcu_tasks_trace_kthread CPU usage
 * after the grace period has ended, or by measuring grace period latency.
 *
 * This benchmark uses both approaches, attaching to rcu_tasks_trace_pregp_step
 * and rcu_tasks_trace_postgp functions to measure grace period latency and
 * using /proc/PID/stat to measure rcu_tasks_trace_kthread kernel ticks
 */
const struct bench bench_local_storage_tasks_trace = {

Things you could try to see if they help, or if 2. below helps with diagnostics

Disable NVMe APST (power-saving) and test keeping NVMe from sleeping. Add to extlinux.conf APPEND nvme_core.default_ps_max_latency_us=0

"irqbalance is Daemon to balance interrupts across multiple CPUs, which can lead to better performance and IO balance on SMP systems. This package is especially useful on systems with multi-core processors, as interrupts will typically only be serviced by the first core."

sudo apt install irqbalance
sudo systemctl enable --now irqbalance
sudo systemctl status irqbalance --no-pager

Two more things you could try adding to extlinux.conf APPEND line

  1. rcupdate.rcu_cpu_stall_timeout=60
    Affects when/how often a stall is reported, which can reduce and maybe keep the system from spiraling out.

  2. hung_task_timeout_secs=60
    As a diagnostic tool to possibly catch what tasks are hanging.
    If set to 0, it disables hung-task warnings entirely.