Rcu: INFO: rcu_preempt self-detected stall on CPU ,Unable to access the system,System freeze

Hi,whitesscott

(base) root@tegra-ubuntu:~# node=/sys/firmware/devicetree/base/bus@0/top0-cbb-fabric@8100800000
(base) root@tegra-ubuntu:~# tr '\0' '\n' < $node/compatible
tr '\0' '\n' < $node/name
tr '\0' '\n' < $node/status

hexdump -Cv $node/reg
hexdump -Cv $node/interrupts
nvidia,tegra264-top0-cbb-fabric
top0-cbb-fabric
okay
00000000  00 00 00 81 00 80 00 00  00 00 00 00 00 80 00 00  |................|
00000010
00000000  00 00 00 00 00 00 00 43  00 00 00 04              |.......C....|
0000000c
(base) root@tegra-ubuntu:~#

I am not using a COE camera. Mine is set to “okay” (enabled), but I see that it is a module for high-speed data transmission switching and managing exception capture. By default, it is enabled, and I haven’t modified this part of the device tree. Is it disabled on your side? If disabled, will it affect GPU acceleration and the normal functioning of CPU features?

Hi wpceswpces,

@WayneWWW Does Nvidia advise against disabling macsec on mgbe Thor?

@wpceswpces If not, would it make any sense to try disabling these in your dtb that ends up in /boot/dtb/ that originates here from kernel/dtb/tegra264- ? change from 1 to 0.

source/hardware/nvidia/t264/nv-public/tegra264.dtsi
			nvidia,macsec-enable = <0x0>;
			nvidia,macsec-enable = <0x0>;
			nvidia,macsec-enable = <0x0>;
			nvidia,macsec-enable = <0x0>;
			nvidia,macsec-enable = <0x0>;

If you agree that following analysis is possible; you might change the device tree as discussed above

"nvidia,macsec-enable = <0x0>; "

The CBB reports are PWRDOWN_ERR on reads to these addresses :
0xa808ade008 = a808a10000.ethernet  mgbe0
0xa808bde008 = a808b10000.ethernet  mgbe1
0xa808dde008 = a808d10000.ethernet  mgbe2
0xa808ede008 = a808e10000.ethernet  mgbe3

Each is base + 0xCE008. That same offset repeating across all MGBE instances points to something touching an MGBE register block at a fixed offset while that block is power-gated / clock-gated, so CBB raises PWRDOWN_ERR.

The crash pattern is equivilent across the four Ethernet mgbe interfaces.
[   17.396487] nvethernet a808a10000.ethernet mgbe0_0: Macsec: Reduced MTU: 1466 Max: 9000
[   17.396488] CPU:0, Error: top-cbb-fabric@0x8100800000, irq=16
[   17.397902]        Error Code         : PWRDOWN_ERR
[   17.397911]        Address         : 0xa808ade008
[   17.621438] nvethernet a808b10000.ethernet mgbe1_0: Macsec: Reduced MTU: 1466 Max: 9000
[   17.621441] CPU:0, Error: top-cbb-fabric@0x8100800000, irq=16

nvethernet driver is configuring Macsec. It adjusts the MTU to account for Macsec overhead.
Immediate after that the driver attempts to read a register at address 0xa808ade008.
    0xa808a00000 is the base of MGBE0.
    The offset 0xde008 falls within the MACSEC / Security Engine sub-block of the Ethernet controller.
The hardware responds with PWRDOWN_ERR. This could mean that the MACSEC logic partition within the Jetson Thor SoC is powered down, but the driver is trying to talk to it?


After the Error: top-cbb-fabric@0x8100800000, irq=16 Error Code : PWRDOWN_ERR is resolved you might want to:

a. Ensure the MTU is set before creating bond0 or before the port is brought up.

b. Fix the bond mode mismatch. Does the switch have an LACP LAG configured for the ports?.
[ 17.589093] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond

hi,whitesscott

If I don’t use camera and audio related functionalities, do you know how to cleanly and completely disable the related modules in the device tree?

Is this an issue with COE contention?

If it is related, could you try if this patch helps?

hi wpceswpces,

I think this is it. There are some more camera dts but this is all the coe I see.

for n in 0 1 2 3; do
  echo -n "coe$n: "
  tr -d '\0' </sys/firmware/devicetree/base/tegra-capture-coe${n}/status 2>/dev/null || echo "missing"
done

dtc -I dtb -O dts -o board.dts -f /boot/dtb((Change to the dtb fil you are using))kernel_tegra264-p4071-0000+p3834-0008-nv.dtb

edit board.dtx and change status = “okay”; to status = “disabled”;

        tegra-capture-coe0 {
                compatible = "nvidia,tegra-camrtc-capture-coe";
                nvidia,cam_controller = <0x22b>;
                nvidia,eth_controller = <0x15f>;
                status = "disabled";
        };

        tegra-capture-coe1 {
                compatible = "nvidia,tegra-camrtc-capture-coe";
                nvidia,cam_controller = <0x22b>;
                nvidia,eth_controller = <0x160>;
                status = "disabled";
        };

        tegra-capture-coe2 {
                compatible = "nvidia,tegra-camrtc-capture-coe";
                nvidia,cam_controller = <0x22b>;
                nvidia,eth_controller = <0x235>;
                status = "disabled";
        };

        tegra-capture-coe3 {
                compatible = "nvidia,tegra-camrtc-capture-coe";
                nvidia,cam_controller = <0x22b>;
                nvidia,eth_controller = <0x236>;
                status = "disabled";
        };
dtc -@ -I dts -O dtb -o kernel_tegra264-p4071-0000+p3834-0008-nv.dtb -f board.dts
cp kernel_tegra264-p4071-0000+p3834-0008-nv.dtb rootfs/boot/kernel_tegra264-p4071-0000+p3834-0008-nv.dtb \
    bootloader/kernel_tegra264-p4071-0000+p3834-0008-nv.dtb

Flash your T5000 board.

for n in 0 1 2 3; do
  echo -n "coe$n: "
  tr -d '\0' </sys/firmware/devicetree/base/tegra-capture-coe${n}/status 2>/dev/null || echo "missing"
done

edit:


And this should be safe.

sudo tee /etc/modprobe.d/blacklist-tegra-camera.conf >/dev/null <<'EOF'
blacklist tegra_capture_coe
blacklist tegra_capture_isp
blacklist tegra_camera
blacklist tegra_camera_rtcpu
blacklist capture_ivc
blacklist camera_diagnostics
EOF

sudo update-initramfs -u
sudo reboot

Hi,whitesscott

Then, how can I ensure the MTU is set before creating bond0 or before the port is brought up? How can I set the MTU in advance? Also, I’d like to ask @WayneWWW whether the CBB issue is related to “a. The hardware responds with PWRDOWN_ERR. This could mean that the MACSEC logic partition within the Jetson Thor SoC is powered down, but the driver is trying to talk to it?”. Additionally, I need to mention “b. Fix the bond mode mismatch. Does the switch have an LACP LAG configured for the ports?”. I am interconnecting two boards, and haven’t yet involved connecting them to a switch.

Hi,whitesscott

Thank you, I will try blacklisting the camera-related drivers and give it another shot.

hi wpceswpces

Option 1 (Best practice): systemd-networkd with 4x MGBE → bond0

/etc/systemd/network/20-bond0.netdev

Name=bond0
Kind=bond

[Bond]
Mode=802.3ad
MIIMonitorSec=1s
# Optional tuning:
# LACPTransmitRate=fast
# TransmitHashPolicy=layer3+4

/etc/systemd/network/20-bond0.network

[Match]
Name=bond0

[Link]
MTUBytes=9000

[Network]
# If using static ip.
IPv6AcceptRA=no
BindCarrier=mgbe0_0 mgbe0_1 mgbe0_2 mgbe0_3

# IPv4 static
Address=192.168.10.20/24
Gateway=192.168.10.1
DNS=192.168.10.1
DNS=1.1.1.1

# IPv6 static (keep if you are using IPv6)
Address=fd00:10::20/64
Gateway=fd00:10::1
DNS=fd00:10::1

# Optional search domain:
# Domains=lan

/etc/systemd/network/10-mgbe0_0.network

[Match]
Name=mgbe0_0

[Link]
MTUBytes=9000

[Network]
Bond=bond0

/etc/systemd/network/10-mgbe0_1.network

Name=mgbe0_1

[Link]
MTUBytes=9000

[Network]
Bond=bond0

/etc/systemd/network/10-mgbe0_2.network

[Match]
Name=mgbe0_2

[Link]
MTUBytes=9000

[Network]
Bond=bond0

/etc/systemd/network/10-mgbe0_3.network

[Match]
Name=mgbe0_3

[Link]
MTUBytes=9000

[Network]
Bond=bond0

Apply and verify

sudo systemctl enable --now systemd-networkd
sudo systemctl restart systemd-networkd

networkctl status bond0
ip addr show bond0
ip route show
resolvectl status bond0
cat /proc/net/bonding/bond0


Option 2: NetworkManager (nmcli) with 4× MGBE into bond0

sudo nmcli con add type bond ifname bond0 con-name bond0 mode 802.3ad
sudo nmcli con modify bond0 802-3-ethernet.mtu 9000
sudo nmcli con modify bond0 bond.options "mode=802.3ad,miimon=100,lacp_rate=fast,xmit_hash_policy=layer3+4"

for i in 0 1 2 3; do
  sudo nmcli con add type ethernet ifname mgbe0_$i con-name bond0-mgbe0_$i master bond0
  sudo nmcli con modify bond0-mgbe0_$i 802-3-ethernet.mtu 9000
done

sudo nmcli con up bond0


# Verify
nmcli -f NAME,TYPE,DEVICE con show --active
ip -d link show bond0
cat /sys/class/net/bond0/mtu
for d in mgbe0_0 mgbe0_1 mgbe0_2 mgbe0_3; do cat /sys/class/net/$d/mtu; done

Hi,whitesscott

I’m using NetworkManager, and I noticed something from what you just said. My 4 MGBE interfaces have an MTU of 1466, but after bonding, it defaults to 1500. I didn’t set this, and I’m not sure if it has any impact.

hi wpceswpces,

Pick an MTU that will work end to end. I saw a @WayneWWW post that said to:
down bond, change mtu, up bond

9000 = most compatible jumbo value where all on path can use it..
9216 = common on some switches/NICs (9K plus headers) not universal.
1500 = always works, but leaves performance on the table at high rates.

If needed, get connection names.
nmcli -f DEVICE,TYPE,STATE,CONNECTION device status

For testing Thor to Thor direct connection, no switch set 9000 both ends

sudo nmcli connection down "bond0"

sudo nmcli connection modify "bond0"   802-3-ethernet.mtu 9000
sudo nmcli connection modify "mgbe0_0" 802-3-ethernet.mtu 9000
sudo nmcli connection modify "mgbe1_0" 802-3-ethernet.mtu 9000
sudo nmcli connection modify "mgbe2_0" 802-3-ethernet.mtu 9000
sudo nmcli connection modify "mgbe3_0" 802-3-ethernet.mtu 9000

sudo nmcli connection up "bond0"

Verify mtu

ip link show bond0 
ip link show mgbe0_0 # 1,2,3

Confirm jumbo really works thor to thor
ping -c 3 -M do -s 8972 thor2_ip_address

Hi,WayneWWW

I have some good news and some bad news. I just recompiled the oot driver with the patch you provided, and it seems that the CBB error is gone. The bad news is that the RCU issue is still there, and I found that when connecting a fiber optic bond to 100G, running netplan apply in the system also triggers the rcu deadlock. Additionally, following @whitesscott’s suggestion, I blacklisted all the camera-related drivers, but the rcu issue still persists. After the system locks up, removing one end of the fiber optic cable can quickly restore the system. What should I check next? I feel like it might still be a bug in the nvethernet driver.

(base) root@tegra-ubuntu:~# cat /etc/modprobe.d/blacklist-tegra-camera.conf
blacklist tegra_capture_coe
blacklist tegra_capture_isp
blacklist tegra_camera
blacklist tegra_camera_rtcpu
blacklist capture_ivc
blacklist camera_diagnostics
blacklist nvhost_nvcsi
blacklist nvhost_vi5
blacklist v4l2_dv_timings
blacklist tegra_camera_platform
blacklist nvhost_isp5
blacklist v4l2_fwnode
blacklist v4l2_async
blacklist videobuf2_dma_contig
blacklist videobuf2_v4l2
blacklist videodev
blacklist videobuf2_common
blacklist host1x_nvhost
blacklist host1x_fence
blacklist tegra_se
blacklist ivc_bus
blacklist hsp_mailbox_client
blacklist nvhost_capture
blacklist host1x
blacklist rtcpu_debug
(base) root@tegra-ubuntu:~# lsmod |grep camera
(base) root@tegra-ubuntu:~# lsmod |grep cap
nvpmodel_clk_cap       12288  0
(base) root@tegra-ubuntu:~# lsmod |grep isp
drm_display_helper    172032  1 tegra_drm
drm_kms_helper        208896  3 drm_display_helper,tegra_drm,nvidia_drm
drm                   602112  17 drm_kms_helper,drm_display_helper,nvidia,tegra_drm,nvidia_drm
(base) root@tegra-ubuntu:~# lsmod |grep vi
nvidia_drm            114688  6
nvidia_modeset       1826816  6 nvidia_drm
nvidia_uvm           1490944  0
nvidia              14798848  86 nvidia_uvm,nvidia_modeset
host1x                180224  5 nvidia,tegra_drm,nvidia_drm,nvhost_pva,nvidia_modeset
mc_utils               12288  1 nvidia
drm_kms_helper        208896  3 drm_display_helper,tegra_drm,nvidia_drm
nvidia_vrs_pseq        12288  0
tegra_dce             126976  2 nvidia
nvidia_cspmu           49152  0
arm_cspmu_module       20480  1 nvidia_cspmu
drm                   602112  17 drm_kms_helper,drm_display_helper,nvidia,tegra_drm,nvidia_drm
(base) root@tegra-ubuntu:~# lsmod |grep coe
(base) root@tegra-ubuntu:~#

hi wpceswpces,

These may be unsafe to blacklist: host1x, host1x_nvhost, host1x_fence, ivc_bus, hsp_mailbox_client

Okay, I will remove these.

Hi,WayneWWW

According to the configuration in the static YAML file above, when linking to 100G, continuously running netplan apply will likely reproduce the issue. The logs are as follows:

dmesg.txt (14.9 KB)

(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~# ethtool bond0
Settings for bond0:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 100000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes
(base) root@tegra-ubuntu:~# netplan apply
(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~#
(base) root@tegra-ubuntu:~# netplan apply





[  848.573334] rcu: INFO: rcu_preempt self-detected stall on CPU
[  848.573338] rcu:     4-....: (5249 ticks this GP) idle=e87c/1/0x4000000000000000 softirq=7225/7225 fqs=2624
[  848.573343] rcu:     (t=5250 jiffies g=20289 q=2790 ncpus=14)






[  873.445206] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 4-.... } 5428 jiffies s: 2017 root: 0x10/.
[  873.445218] rcu: blocking rcu_node structures (internal RCU debug):
[  911.585056] rcu: INFO: rcu_preempt self-detected stall on CPU
[  911.585058] rcu:     4-....: (21002 ticks this GP) idle=e87c/1/0x4000000000000000 softirq=7225/7225 fqs=10499
[  911.585062] rcu:     (t=21003 jiffies g=20289 q=5508 ncpus=14)
[  936.932921] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 4-.... } 21300 jiffies s: 2017 root: 0x10/.
[  936.932933] rcu: blocking rcu_node structures (internal RCU debug):
[  967.652802] INFO: task khugepaged:108 blocked for more than 120 seconds.
[  967.652835]       Tainted: G           OE      6.8.12-tegra #1
[  967.652854] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  967.659421] INFO: task kworker/u41:2:2959 blocked for more than 120 seconds.
[  967.666195]       Tainted: G           OE      6.8.12-tegra #1
[  967.679934] INFO: task NetworkManager:3703 blocked for more than 120 seconds.message.
[  967.692734] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  974.596762] rcu: INFO: rcu_preempt self-detected stall on CPU
[  974.596764] rcu:     4-....: (36755 ticks this GP) idle=e87c/1/0x4000000000000000 softirq=7225/7225 fqs=18375
[  974.596766] rcu:     (t=36756 jiffies g=20289 q=7911 ncpus=14)


hi wpceswpces,

Try this if it’s not how you are already running netplan apply.

ip link set bond0 down

for i in mgbe0_0 mgbe1_0 mgbe2_0 mgbe3_0; do ip link set "$i" down; done

netplan apply

for i in mgbe0_0 mgbe1_0 mgbe2_0 mgbe3_0; do ip link set "$i" up; done

ip link set bond0 up

Following backtraces may show the function NetworkManager (or the kworker) is stuck on.

Enable magic SysRq (once)
sudo sysctl -w kernel.sysrq=1

When the system is “hung”, dump task states + stacks
echo w | sudo tee /proc/sysrq-trigger # blocked tasks
echo t | sudo tee /proc/sysrq-trigger # full task backtraces

Then grab/post
dmesg -T | tail -n 400

hello,whitesscott

Did I capture the lower-level stack information this way?When the rcu stall occurred, I unplugged the fiber, and only after the system recovered did I input the dmesg

COM13 (USB Serial Port (COM13))_2026-02-04-165417.log (83.2 KB)

COM6 (USB Serial Port (COM6))_2026-02-04-165316.log (53.7 KB)

Could it be that there’s a bug in nvethernet handles interrupts?Can the stack trace be specific enough to pinpoint the exact function and line where it is stuck?

hi wpceswpces,

On your Thor, the driver/kernel is effectively pinning all MGBE IRQs to CPU 4 via affinity_hint and the resulting effective_affinity.
That may create: softirq pressure on one CPU, long non-preemptible sections in networking paths
RCU stall warnings on that CPU when things get busy (link flap / bonding / speed change storms)

Your dmesg stall stack showed CPU4 stuck in handle_softirqs while a worker was in set_speed_work_func [nvethernet]. Pinning interrupts away from CPU4 is a potential mitigation because it reduces the chance that CPU4 gets saturated by NIC softirq + link-event work at the same time.

This won’t fix a driver bug, but it can turn a “stall” into “no stall” if the root cause is one CPU getting overwhelmed during link flaps / speed-set work / bonding transitions.

Currently script prints configuration after run. Hope it helps.

Save attachment to cpuaffinity.sh

sudo ./cpuaffinity.sh 

cpuaffinity.sh.txt (7.8 KB)



If it helps it can be made less chatty and into a service. And then you may want to also


Disable autonegotiation:

ethtool -s mgbe0 speed 25000 duplex full autoneg off

and repeat for mgbe1, mgbe2, mgbe3


Edit /boot/extlinux/extlinux.conf
Add threadirqs to the APPEND line.
Reboot

Hi,WayneWWW

Based on yesterday’s findings, applying netplan apply to reconfigure the network card will trigger the issue. I did a test on the devkit and found that the same rcu issue can be reproduced. Could you try it on your side as well? @WayneWWW @whitesscott The steps are as follows:

  1. Connect the two devkit via optical modules and fiber optics.
  2. Use the YAML file I provided above, recompile the kernel with bond functionality, configure both devices to be on the same subnet but with different IPs and MAC addresses, and bond them to 100G.
  3. Both devices should frequently execute netplan apply at the same time. Run the following command on both devices:
    while true; do netplan apply; sleep 1; done
    The rcu deadlock and reboot issue can be reproduced very quickly. Below is the serial console log from my side:

COM69 (USB 串行设备 (COM69))_2026-02-05-093025.log (487.4 KB)

COM26 (USB 串行设备 (COM26))_2026-02-05-093036.log (50.8 KB)

Please keep me updated on any progress. Thank you very much! Additionally, I need to add that I am using the default configuration and have not enabled the threaded feature for MGBE.

Also, I just tested enabling the 4x MGBE threaded feature, and the issue also occurs. When both devices are stuck at the same time, one device will recover after a 120s watchdog reboot and network disconnection, which will also cause the other device to recover.The log is as follows:

COM26 (USB 串行设备 (COM26))_2026-02-05-095631.log (522.4 KB)

COM69 (USB 串行设备 (COM69))_2026-02-05-095655.log (564.4 KB)

Please give us step by step method first. Thanks.

For example, what are the exact configs got enabled in your “recompile the kernel with bond functionality”?

Where to put your YAML file and how did you make it take effect.

Hi,WayneWWW

Please refer to the above. To enable bonding in the kernel, you only need to set CONFIG_BONDING=m to get the bonding.ko module.

root@tegra-ubuntu:~# lsmod |grep bond
bonding               147456  0
ipv6                  442368  76 bridge,bonding
root@tegra-ubuntu:~#

The YAML file should be placed under /etc/netplan/, and NetworkManager will take control.

02-network-manager-all.yaml.txt (664 Bytes)

  1. Connect the two devkit via optical modules and fiber optics.

  2. Use the YAML file I provided above,Remove the .txt extension from the file., recompile the kernel with CONFIG_BONDING=m or just build this module , modprobe the bonding.ko ,configure both devices to be on the same subnet but with different IPs and MAC addresses, and bond them to 100G in the yaml file( addresses:

    • 192.168.139.44/24
      macaddress: “00:11:22:33:44:55” ).And they can ping each other successfully via 100G.
root@tegra-ubuntu:~# ethtool bond0
Settings for bond0:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 100000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes
root@tegra-ubuntu:~#

3.Both devices should frequently execute netplan apply at the same time. Run the following command on both devices:
while true; do netplan apply; sleep 1; done;

4.then you can get the error within 5min.

COM69 (USB 串行设备 (COM69))_2026-02-05-101950.log (46.3 KB)

COM26 (USB 串行设备 (COM26))_2026-02-05-103936.log (42.9 KB)

Please keep me updated on any progress. Thank you very much!If you have any issues, feel free to contact me anytime. Looking forward to your good news.