Rcu: INFO: rcu_preempt self-detected stall on CPU ,Unable to access the system,System freeze

Hello, NVIDIA experts,

I currently have a lot of custom boards using Jetpack version 7.1. We have routed 4 MGBE (Multi-Gigabit Ethernet) ports from UPHY to optical modules, then connected the optical module to external cables. and we use the 25G config . After that, we used a YAML file to aggregate the 4 MGBE ports into a bond (bond0) via LACP (802.3ad). In fact, we have also tried other bond modes, such as balance-rr, and the issue is the same. When Plug in the power, the system shows the following errors:

  • CPU:0, Error: top-cbb-fabric@0x8100800000, irq=16

  • rcu: INFO: rcu_preempt self-detected stall on CPU (system freezes and cannot boot properly).

We tried entering the system without creating the bond, and it worked, but then the 4 25G ports are used separately. We need them to be used as one logical port.

Additionally, if we insert the optical module after the system has booted, the system doesn’t freeze, but it logs the error message:

  • CPU:0, Error: top-cbb-fabric@0x8100800000, irq=16

Please help us investigate this issue. It is critical for us and has a significant impact.

COM13 (USB Serial Port (COM13))_2026-01-26-100010.log (217.0 KB)

And Our YAML file is like this:

yaml.txt (517 Bytes)

Could you share us the steps to reproduce this issue on NV devkit?

Hi,WayneWWW

I’m very glad to receive your reply. The optical module on the devkit we are using is not the same model as the one soldered on the custom board.The optical module on the devkit is a finished product that is directly inserted into the QSFP slot, whereas on our custom board, it is surface-mounted on the PCB. This issue has not occurred on the kit, but I will try to further confirm the details. Could you please advise what might be causing the system freeze on our custom board? If the issue doesn’t occur on the devkit, is it possible to perform debugging and troubleshooting based on the current custom board environment?

Hi,WayneWWW

COM59 (USB 串行设备 (COM59))_2026-01-27-114531.log (155.6 KB)

COM26 (USB 串行设备 (COM26))_2026-01-27-114203.log (302.6 KB)

I just tried connecting two devkits via optical modules and optical fibers. After bonding and linking to 100G, when I unplug and plug the power back in, the system doesn’t freeze, but it still reports the ‘top-cbb’ error.However, my custom board will freeze and additionally report an ‘rcu’ error.What is the cause of this? Is it related to the power-up sequence of the optical module? What should I do next?

. After bonding and linking to 100G

Could you share the steps to enable this bonding link?

Hello,WayneWWW

Of course. First, the kernel must support BOND; enable CONFIG_BONDING to get bonding.ko. Then, use the yaml.txt I provided above as a sample and place it under /etc/netplan/. In the configuration file, the four links will be aggregated into one using LACP.

yaml.txt (517 Bytes)

Only when the system starts, and the optical fiber is properly connected to the optical module and the bond configuration is set, will it appear. If either of these two conditions is missing, it will not appear. For example, if one end of the optical fiber is disconnected from a board (not linked), or if the fiber is not properly connected during startup, it won’t appear even if the fiber is connected after the system starts. It seems that once the system enters and the CPU starts scheduling, it prevents a system freeze.

If the optical fiber is unplugged promptly after RCU occurs, the system can still continue to boot.

Is there any progress?

Any news? Progress?

Is this issue with RT kernel or even the normal kernel would reproduce this?

Hi,WayneWWW

We are using the normal kernel, not the RT version.

Linux tegra-ubuntu 6.8.12-tegra #1 SMP PREEMPT Fri Jan  9 11:22:23 CST 2026 aarch64 aarch64 aarch64 GNU/Linux

please still share step by step method here. To make sure we are aligned with your steps.

And what is the expected result after running that.

Hi,WayneWWW

The configuration of the optical modules in the suite and the finished product is the same. So far, I haven’t observed any RCU issues when the system boots, although there is an error with top-cbb-fabric. This issue will inevitably occur on our custom board. There are no special steps involved; it’s just as I mentioned earlier:

  1. I recompiled the bonding.ko based on the kernel defconfig and replaced all the .ko files on the kernel and board.

  2. I edited the YAML file under /etc/netplan/, and used LACP to bond the 4 interfaces as shown in the example I provided.

  3. The two boards have identical configurations (same subnet, different IPs), connected via fiber, the link is up at 100G, and I can ping between them.

  4. When I power off and reboot both boards, the system experiences an RCU deadlock and login is not possible.

Hi wpceswpces,

Following is a lot of supposition on my part. From your log it appears UPHY / SerDes / Ethernet PCS is being accessed before the relevant power domain is on or before reset/enable GPIOs/regulators are asserted. It toggles GPIO682 to 1 at ~17.18s, but the top-cbb PWRDOWN_ERR occurs at 16.54s, before that GPIO is enabled. That may point to GPIO682 likely enabling something required for the UPHY/Ethernet path (module power, retimer enable, PHY reset deassert, and Linux tries to access UPHY/Ethernet registers before that “something” is enabled, leading to PWRDOWN_ERR and later RCU stall.

What is GPIO682? If GPIO682 is module/retimer/PHY, enable it in device tree before nvethernet/UPHY register access,.

Hi,whitesscott

I’m very glad to receive your reply. GPIO682 is a regular GPIO, and I just exported a flag signal in rc-local, it is not related to the module/retimer/PHY. After reviewing the documentation, I also suspect that the issue might be related to the kernel power domain during system startup, and there could be a timing issue with the optical module. Both our CPU and optical module are controlled by external components, and the power-up and reset are performed according to the power-up sequence provided by our hardware engineers. I’m not sure if this has affected or triggered other bugs, so I would like to consult with NVIDIA engineers to take a look.This issue is very important to us.

I think the key problem here is that this issue is not reproducible on NV devkit according to your comment.

Is it possible to share out the schematic and QSFP module to review first?

Hi wpceswpces,

A different tactic; bring up network not at boot, but slightly later, before network.target, leave optics connected, and test if your board is ok?

sudo nmcli con modify bond0 connection.autoconnect no

sudo tee /usr/local/sbin/mgbe-stagger-up.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

IFACES=(mgbe0_0 mgbe1_0 mgbe2_0 mgbe3_0)

# If bond exists, bring it down first and ignore errors.
nmcli con down bond0 2>/dev/null || true
ip link set down bond0 2>/dev/null || true

# Bring up each mgbe with a delay
for i in "${IFACES[@]}"; do
  ip link set "$i" up
  sleep 2
done

ip link set bond0 up 2>/dev/null || true
nmcli con up bond0 2>/dev/null || true

exit 0
EOF

sudo chmod +x /usr/local/sbin/mgbe-stagger-up.sh

Run script with a service:

sudo tee /etc/systemd/system/mgbe-stagger-up.service >/dev/null <<'EOF'
[Unit]
Description=Stagger bring-up of MGBE interfaces
Wants=network-pre.target
After=network-pre.target
Before=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/mgbe-stagger-up.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now mgbе-stagger-up.service
sudo systemctl status mgbе-stagger-up.service --no-pager

Hi wpceswpces

Are you still using NFS for rootfs? Are you using nvethernet MGBE at all? And is your WangXun NIC “25GbE = txgbe” or slower ngbe. Your log shows ngbe.

Have you tried using:

  1. built-in for networking:
.config
CONFIG_NET_VENDOR_WANGXUN=y
CONFIG_TXGBE=y
CONFIG_TLS=y  # required to be 'y' so bonding can be 'y'
CONFIG_BONDING=y

Networking support / Networking options / Transport Layer Security support 
Set it to [*] to make it built-in (CONFIG_TLS=y).

Device Drivers / Network device support / Bonding driver support (CONFIG_BONDING)
and set it to [*] built-in.

Symbol: NET_VENDOR_WANGXUN [=y]                                                                                 │  
Device Drivers / Network device support (NETDEVICES [=y]) / Ethernet driver support (ETHERNET[=y]) / Wangxun devices (NET_VENDOR_WANGXUN [=y])   

  1. Put the bond setup in the kernel command line.

Append after APPEND in /boot/extlinux/extlinux.conf

ip=<client-ip>::<gw-ip>:<mask>:<hostname>:bond0:off
bond=bond0:mgbe0_0,mgbe1_0,mgbe2_0,mgbe3_0:mode=802.3ad,miimon=100,xmit_hash_policy=layer3+4,lacp_rate=1
root=/dev/nfs
nfsroot=<SERVER_IP>:/export/thor-root,vers=4.2,tcp

hi,whitesscott

Our Wangxun NIC is 1000Mb, and we are still using nvethernet MGBE. Currently, I have both NFS and local rootfs set up. Yesterday, I tried resetting the optical module after booting to rc.local, and it seemed to work with some probability, allowing the system to boot. This morning, after powering on both boards, I found that the login prompt appeared, but there was still an rcu error. It seems that this might have only mitigated the issue to some extent, possibly due to timing, but it still doesn’t fundamentally resolve the problem. Fundamentally, I feel that the issue lies in the driver trying to fetch some status, which is causing the CPU to block. It feels like a bug.

What’s the situation on your side? Have you encountered a similar issue, like “CPU:0, Error: top-cbb-fabric@0x8100800000”?

HI,WayneWWW

I found that even unplugging the power of one device causes the interconnected other device to consistently report rcu issues on the board under the system. However, this may be because there is system scheduling, so although rcu appears, it doesn’t cause a full deadlock. I exported the dmesg. Can you help analyze the stack trace? Is it possible to add debugging in certain kernel source codes to further locate the issue? The pc: handle_softirqs+0xa8/0x36c pointer suggests that the problem may lie in the handling of soft interrupts.Please help analyze it as well. Thank you very much! It seems like there’s an issue when the driver is trying to “link”,Especially the Workqueue: events set_speed_work_func [nvethernet] part here.

dmesg.txt (130.4 KB)

(base) root@tegra-ubuntu:~# nvpmodel -m 0
▒▒INFO: END TASK:MB▒▒
INFO: enter idle task.
INFO: END TASK:MB▒▒
INFO: enter idle task.
▒▒(base) root@tegra-ubuntu:~# jetson_clocks
Enabled Legacy persistence mode for GPU 00000000:01:00.0.
All done.
(base) root@tegra-ubuntu:~# [  427.775249] rcu: INFO: rcu_preempt self-detected stall on CPU
[  427.775251] rcu:     4-....: (1 GPs behind) idle=3eec/1/0x4000000000000000 softirq=2878/2880 fqs=2142
[  427.775256] rcu:     (t=5250 jiffies g=7221 q=611 ncpus=14)

(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~#

[  406.783604] nvethernet a808a10000.ethernet mgbe0_0: Link is Up - 25Gbps/Full - flow control off
[  427.775249] rcu: INFO: rcu_preempt self-detected stall on CPU
[  427.775251] rcu: 	4-....: (1 GPs behind) idle=3eec/1/0x4000000000000000 softirq=2878/2880 fqs=2142
[  427.775256] rcu: 	(t=5250 jiffies g=7221 q=611 ncpus=14)
[  427.775258] CPU: 4 PID: 132 Comm: kworker/4:1 Tainted: G        W  OE      6.8.12-tegra #1
[  427.775260] Hardware name: NVIDIA NVIDIA Jetson AGX Thor Developer Kit/Jetson, BIOS 202512.0-39e87081 12/31/2025
[  427.775261] Workqueue: events set_speed_work_func [nvethernet]
[  427.775275] pstate: 43400009 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  427.775277] pc : handle_softirqs+0xa8/0x36c
[  427.775287] lr : handle_softirqs+0x78/0x36c
[  427.775288] sp : ffff800080023f30
[  427.775289] x29: ffff800080023f30 x28: ffff000086d25dc0 x27: 0000000000000000
[  427.775291] x26: 0000000000000084 x25: ffffb2cecd82bcf0 x24: 0000000000000000
[  427.775293] x23: 0000000063400009 x22: 0000000000000207 x21: ffff800083cdb540
[  427.775295] x20: ffffb2cecacd021c x19: 0000000000000000 x18: fffffffffffe8d9f
[  427.775297] x17: ffff4d5086ea0000 x16: ffff800080020000 x15: ffffffffffffffff
[  427.775298] x14: 000000000000077c x13: 00000000ffffffea x12: ffffb2cecd883e80
[  427.775300] x11: 0000000000000040 x10: ffff00008000a468 x9 : 0000000000000000
[  427.775302] x8 : ffff001f53d84588 x7 : bc8140b7f3a4278e x6 : 0000000873e106b0
[  427.775304] x5 : 1fffffffffffffff x4 : 0000000000000015 x3 : 0000005eb6317f6d
[  427.775306] x2 : ffff000086d25dc0 x1 : ffffb2ceccee5e80 x0 : ffff4d5086ea0000
[  427.775308] Call trace:
[  427.775310]  handle_softirqs+0xa8/0x36c
[  427.775312]  __do_softirq+0x14/0x20
[  427.775314]  ____do_softirq+0x10/0x1c
[  427.775318]  call_on_irq_stack+0x24/0x4c
[  427.775320]  do_softirq_own_stack+0x1c/0x28
[  427.775322]  irq_exit_rcu+0xbc/0xcc
[  427.775324]  el1_interrupt+0x38/0x68
[  427.775328]  el1h_64_irq_handler+0x18/0x24
[  427.775330]  el1h_64_irq+0x68/0x6c
[  427.775331]  vprintk_store+0x234/0x448
[  427.775334]  vprintk_emit+0xb0/0x2b4
[  427.775336]  dev_printk_emit+0xac/0xe0
[  427.775339]  __netdev_printk+0xc0/0x20c
[  427.775344]  netdev_info+0x64/0x90
[  427.775345]  phy_print_status+0x78/0x124
[  427.775349]  set_speed_work_func+0x114/0x2b8 [nvethernet]
[  427.775354]  process_one_work+0x170/0x3fc
[  427.775358]  worker_thread+0x320/0x438
[  427.775360]  kthread+0x110/0x114
[  427.775363]  ret_from_fork+0x10/0x20

Hi wpceswpces

Are you using Coe camera? It’s mgbe might be in contention with mgbe*_0.

In case this might be helpful, it is where this comes from.

[2026-01-26 10:00:57]  [   16.547627] CPU:0, Error: top-cbb-fabric@0x8100800000, irq=16
[2026-01-26 10:00:57]  [   16.547642] **************************************
[2026-01-26 10:00:57]  [   16.547654] CPU:0, Error:top-cbb-fabric, Errmon:1
[2026-01-26 10:00:57]  [   16.549042]   Error Code: PWRDOWN_ERR
[2026-01-26 10:00:57]  [   16.552881]   Overflow: Multiple PWRDOWN_ERR
[2026-01-26 10:00:57]  [   16.552886] 
[   16.552887]   Error Code: PWRDOWN_ERR
[   16.552889]   MASTER_ID: CCPLEX
[2026-01-26 10:00:57]  [   16.566160]   Address: 0xa808ade008
[2026-01-26 10:00:57]  [   16.569996]   Cache: 0x1 -- Bufferable 
[2026-01-26 10:00:57]  [   16.574187]   Protection: 0x2 -- Unprivileged, Non-Secure, Data Access
[2026-01-26 10:00:57]  [   16.580823]   Access_Type: Read
[2026-01-26 10:00:57]  [   16.584314]   Access_ID: 0x3
[2026-01-26 10:00:57]  [   16.584316]   Fabric_Id: 0x4
[2026-01-26 10:00:57]  [   16.590601]   Fabric: uphy0-cbb-fabric
[   16.594444]   or Fabric: aon-fabric

jp7/Linux_for_Tegra/source/kernel/kernel-noble/Documentation/devicetree/bindings/arm/tegra/nvidia,tegra234-cbb.yaml

  The Control Backbone (CBB) is comprised of the physical path from an
  initiator to a target's register configuration space. CBB 2.0 consists
  of multiple sub-blocks connected to each other to create a topology.
  The Tegra234 SoC has different fabrics based on CBB 2.0 architecture
  which include cluster fabrics BPMP, AON, PSC, SCE, RCE, DCE, FSI and
  "CBB central fabric".

  In CBB 2.0, each initiator which can issue transactions connects to a
  Root Master Node (MN) before it connects to any other element of the
  fabric. Each Root MN contains a Error Monitor (EM) which detects and
  logs error. Interrupts from various EM blocks are collated by Error
  Notifier (EN) which is per fabric and presents a single interrupt from
  fabric to the SoC interrupt controller.

  The driver handles errors from CBB due to illegal register accesses
  and prints debug information about failed transaction on receiving
  the interrupt from EN. Debug information includes Error Code, Error
  Description, MasterID, Fabric, SlaveID, Address, Cache, Protection,
  Security Group etc on receiving error notification.

  If the Error Response Disable (ERD) is set/enabled for an initiator,
  then SError or Data abort exception error response is masked and an
  interrupt is used for reporting errors due to illegal accesses from
  that initiator. The value returned on read failures is '0xFFFFFFFF'
  for compatibility with PCIE.

jp7/Linux_for_Tegra/source/kernel/kernel-noble/drivers/soc/tegra/cbb/tegra234-cbb.c

                /*
                 * In T264, AON Fabric ID value is incorrectly same as UPHY0 fabric ID.
                 * For 'ID = 0x4', we must check for the address which caused the error
                 * to find the correct fabric which returned error.
                 */
                tegra_cbb_print_err(file, "\t  or Fabric\t\t: %s\n",
                                    cbb->fabric->fab_list[T264_AON_FABRIC_ID].name);
                tegra_cbb_print_err(file, "\t  Please use Address to determine correct fabric.\n");

static const struct tegra234_fabric_lookup tegra241_cbb_fab_list[];
/*
 * Possible causes for Slave and Timeout errors.
 * SLAVE_ERR:
 * Slave being accessed responded with an error. Slave could return
 * an error for various cases :
 *   Unsupported access, clamp setting when power gated, register
 *   level firewall(SCR), address hole within the slave, etc
 *
 * TIMEOUT_ERR:
 * No response returned by slave. Can be due to slave being clock
 * gated, under reset, powered down or slave inability to respond
 * for an internal slave issue
 */
static const struct tegra_cbb_error tegra241_cbb_errors[] = {                .code = "PWRDOWN_ERR",
                .desc = "Attempt to access a portion of fabric that is powered down"

Here’s dts

source/hardware/nvidia/t264/nv-public/tegra264.dtsi

        bus@0 {
                compatible = "simple-bus";
...
                top0-cbb-fabric@8100800000 {
                        compatible = "nvidia,tegra264-top0-cbb-fabric";
                        reg = <0x81 0x800000 0x0 0x800000>;
                        interrupts = <GIC_SPI 67 IRQ_TYPE_LEVEL_HIGH>;
                        status = "disabled";
                };

                vision-cbb-fabric@8180800000 {
                        compatible = "nvidia,tegra264-vision-cbb-fabric";
                        reg = <0x81 0x80800000 0x0 0x800000>;
                        interrupts = <GIC_SPI 324 IRQ_TYPE_LEVEL_HIGH>;
                        status = "disabled";
                };

source/hardware/nvidia/t264/nv-public/nv-platform/tegra264-p3834-common.dtsi
/ {
        bus@0 {

                top0-cbb-fabric@8100800000 {
                        status = "okay";
                };

Is your top0-cbb-fabric@8100800000 “okay” ?

node=/sys/firmware/devicetree/base/bus@0/top0-cbb-fabric@8100800000
tr '\0' '\n' < $node/compatible
tr '\0' '\n' < $node/name
tr '\0' '\n' < $node/status

hexdump -Cv $node/reg
hexdump -Cv $node/interrupts