Rcu: INFO: rcu_preempt self-detected stall on CPU ,Unable to access the system,System freeze

Hi wpceswpces,

Restating your conclusions. There is still a repeated LINK_OK storm that triggers restart_lane_bringup(ENABLE) while lane_status=1.

As noted, there is a second storm source with ls=0x3000000, which the code currently treats as “other/ignored”. Let’s treat ls=0x3000000 as an abnormal or combined LS state that deserves explicit logging/handling, because right now it is arriving at storm rates and being ignored.

Restart requests are arriving while the tasklet is already pending, which may indicate a missing dedup/containment. The RCU stall remains a downstream effect of the resulting tasklet/softirq flood. Spreading IRQ affinity helped load distribution, but the logs show it is not the root cause.
The root problem still looks like status/interrupt churn in the MAC link-state path.

Here are two small troubleshooting patches targeting those issues, that:

  1. Explicitly names ls=0x3000000 as a combined/abnormal LS state.

  2. Suppresses restart_lane_bringup(ENABLE) when the lane-restart tasklet is already scheduled.

    osd.c_Replacment_function.txt (1.1 KB)

    mgbe_core.c_Replacment_function.txt (5.4 KB)

Hi wpceswpces,

Here’s a small guard added to osd.c, that is narrowly focused on the “disable → restart already pending” case, so that osd_restart_lane_bringup() does not keep re-scheduling the tasklet. This is experimental because pdata->tx_start_stop is only a single state variable. So this patch may help with duplicate disable storms, but does not solve all ordering races between disable and enable.

It skips re-scheduling the tasklet when all of these are true:

  1. incoming en_disable matches current pdata->tx_start_stop

  2. the request is OSI_DISABLE

  3. set_speed_work is already pending

    osd.c.txt (28.6 KB)



You may have already done this. Here’s a script to move each mgbe IRQ to it’s own cpu.

mgbe-4-irq.sh.txt (881 Bytes)


If I have followed the code; the 25G recovery path in nvethernet appears to have a weakly-coalesced restart design: fault events schedule a high-priority tasklet, the tasklet schedules delayed speed work, and I do not see an explicit mod_delayed_work()/pending-style guard around that restart path. In a 4×25 bonded setup, that could plausibly contribute to restart churn, delayed recovery, and the RCU-stall symptoms you encounter.

25G bringup: The driver eventually runs delayed set_speed_work to set the port back to 25G and bring carrier back once the RM SET_SPEED path succeeds.

25G fault recovery: When the MAC reports a fault, the RM layer calls into Linux, Linux stores a disable/enable state, schedules a high-priority tasklet, and the tasklet either stops TX and schedules delayed speed recovery or restarts TX. This restart path does not appear to use an explicit “pending + mod_delayed_work” coalescing pattern like the following RCU code contains:

kernel/kernel-noble/kernel/rcu/tree.c

 static void
__schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
{
	long delay, delay_left;

	delay = krc_count(krcp) >= KVFREE_BULK_MAX_ENTR ? 1:KFREE_DRAIN_JIFFIES;
	if (delayed_work_pending(&krcp->monitor_work)) {
		delay_left = krcp->monitor_work.timer.expires - jiffies;
		if (delay < delay_left)
			mod_delayed_work(system_wq, &krcp->monitor_work, delay);
		return;
	}
	queue_delayed_work(system_wq, &krcp->monitor_work, delay);
}

Hi,whitesscott

Thanks for your detailed analysis and response. I have implemented several optimizations to mitigate RCU issues caused by interrupt storms:

  1. Added nvidia,common_irq-cpu-id in the device tree to separate the common interrupts.

  2. Added a flag variable in ether_priv_data to prevent repeated calls to restart_lane_bringup when the link is already up.

  3. Downgraded the high-priority tasklet tasklet_hi_schedule(&pdata->lane_restart_task) to a workqueue using schedule_work(&pdata->lane_restart_work) to avoid non-preemptible execution

Could you check if the approaches above might help mitigate this issue?

Hi wpceswpces,

Yes, these look like excellent mitigation steps for an RCU-stall issue driven by excessive interrupt and deferred work activity.

Separating the common IRQs with nvidia,common_irq-cpu-id may help reduce per-CPU saturation if too much common-interrupt load was landing on one core. Adding a guard to avoid repeated restart_lane_bringup calls when the link is already up also makes sense, because it could reduce self-generated restart churn and interrupt amplification. Changing the restart path from tasklet_hi_schedule() to schedule_work() is likewise a good experiment, since moving that work out of high-priority softirq context should reduce long non-preemptible execution and may lower the chance of triggering RCU stalls.

They could reduce the severity or frequency of the stalls, especially if the immediate problem is CPU starvation from repeated interrupt and restart activity. At the same time, they do not by themselves prove the root cause, so the key follow up will be whether they materially reduce stall frequency, interrupt growth, and the number of restart events without creating new link-recovery issues.

Hi @wpceswpces

We tried to debug with the scripts you shared.

But we notice that the issue might not be happened in real life so even fixing that issue might not really fix the issue you hit right now.

What I mean here is the issue we reproduced on our side now might not be the same issue happen on your side.

Is your usecase just “reboots when bonding is already active”? One side of device reboots or both side of device would reboot at same time?

Hi,WayneWWW


We initially found that when two interconnected devices are powered on at the same time (that is, manually applying power to both while the cables are already connected), neither of them could boot into the system.
I also noticed that if two interconnected devices are already running and I reboot one of them, it may fail to log in properly during the startup process.
Similarly, if one of the interconnected devices is powered off and then restarted, there’s a chance that it will hang at the end of boot and fail to log in.

When the system gets stuck, unplugging one MGBE connection — whether transmit or receive — immediately allows the system to recover from the RCU state. So, this issue appears to be strongly related to MGBE.

If bonding is not configured, the issue doesn’t seem to occur, but enabling bonding makes it worse. Since the system performs network-related configuration (like up, bonding, and IP setup) near the end of boot, which is managed and triggered by NetworkManager, we used repeated triggers under the system environment to reproduce it.

On the devkit, when repeatedly reproducing the issue, unplugging the QSFP optical module can also immediately recover the system.

When you mentioned that this might not be the same issue, was that based on differences in the dmesg stack analysis?

We initially found that when two interconnected devices are powered on at the same time (that is, manually applying power to both while the cables are already connected), neither of them could boot into the system.
I also noticed that if two interconnected devices are already running and I reboot one of them, it may fail to log in properly during the startup process.
Similarly, if one of the interconnected devices is powered off and then restarted, there’s a chance that it will hang at the end of boot and fail to log in.

Want to clarify. Are the above cases all lead to the rcu preempt error? Or you mix up something else together too?

I only want to discuss the rcu_preempt error here. Also, the case here is all with bonding enabled and static IP is already configured?

Hi,WayneWWW

Yes, all the abnormal cases I mentioned are RCU errors — no other issues are mixed in.
When the system fails to log in, we can see it is stuck in an RCU state.
If we unplug the MGBE connection, the login prompt appears immediately.
Bonding and static IP are both enabled in all these cases.

In addition, this issue occurs in the setup where two Thor boards are directly connected to each other.We once connected one Thor to a router with the same configuration, and the RCU issue did not appear.I’m not completely sure, but I suspect the difference might be related to the bring-up or negotiation process between two Thors with identical bonding configurations, compared to the case where one Thor is connected to a router.

But I can’t say for sure — it’s certain that there’s no problem when connected to the router.

Hi,

Let me change the question into another way.

If you just reboot one side of device but always keep the other side untouched, will you hit this error?

Hi,WayneWWW

Yes, it will happen, but not every time. When both interconnected devices are powered on and have entered the system, the link comes up at 100G and can be pinged. Then, if I reboot one of them through the serial port or SSH, the issue occurs. I tested this on a custom board.

To confirm my understanding.

You have 2 devices connected. Both sides have bonding enabled and static IP configured.

Just keep rebooting one of the device and it will happen intermittently with RCU preempt .
When issue happened, just unplug the fiber cable and reconnect it again will recover the system.

If above is correct, could you share me the UART log for this error log (including the recovery from error part)

Hi,WayneWWW

Yes, I’d like to confirm if both rebooting and power cycling would be fine? I’ll replace it with the original nvethernet.ko a bit later and then collect the logs again.

I think that is also part of the items that want to test.

I mean whether reboot or cold boot power cycle makes any difference to the situation.

Hi,WayneWWW

This is the UART log and dmesg that I just reproduced on one device by rebooting only. After the RCU issue occurred, the system froze for a long time. I then disconnected one MGBE (RX/TX), which immediately restored the system. Once it booted into the OS, I could see that the link came up on three channels (75G), and then I exported the dmesg.However, I believe the issue has occurred both after reboot and after power cycling.

COM13 (USB Serial Port (COM13))_2026-03-25-153838.log (524.8 KB)

dmesg_0325.txt (130.0 KB)

Hi,WayneWWW

This is the log I reproduced by repeatedly plugging and unplugging the power on one custom board. After the RCU issue occurred, the login prompt only appeared once I disconnected one MGBE channel. The input was still somewhat laggy, but after disconnecting another mgbe(the board linked to 50G now), the system fully recovered. Then I collected the dmesg information.

dmesg_power_cycle.txt (130.2 KB)

COM13 (USB Serial Port (COM13))_2026-03-25-155714.log (965.6 KB)

and The number of common IRQs is quite large, around the million level, and just as whitesscott mentioned above , they are all concentrated on CPU4.

Hi wpceswpces,

Two more things that might contribute to the problem:

  1. xpcs_check_pcs_lock_status() uses RETRY_ONCE (1 retry = 2ms total) to wait for PCS block lock after 25G lane bringup. On a direct-attach fiber link this is far too short. The function returns -1 immediately, set_speed_work_func() reschedules itself after 1 second, and with four interfaces all failing simultaneously the workqueue is flooded at 4 retries/second indefinitely. The kworkers accumulate in D-state and stall RCU grace periods.

  2. The static-1000ms retry interval in set_speed_work_func() provides no backoff, so a persistent link failure (e.g. while waiting for the remote board to come up) keeps hammering the workqueue.

Modification 1. Extend xpcs_check_pcs_lock_status() to allow 100ms for PCS block lock when operating in 25G mode.

Modification 2. Replace fixed 1000ms retry with exponential backoff (1s → 2s → 4s → 8s → 16s → 30s capped). Reset the retry counter on success or whenever a fresh link event cancels the pending work.

Note: another potential cause of error is that xlgpcs_init() has its full XLGPCS mode-selection sequence guarded by “#if 0 //FIXME”. That requires the correct T26X hardware sequence from NVIDIA and is not addressed here.



osd.c.txt (27.1 KB)

ether_linux.h.txt (28.0 KB)

ether_linux.c.txt (221.6 KB)

# Copy attached to:

cp ether_linux.* source/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/
cp osd.c source/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/

Here’s the patch that failed to apply, which is why I attached full files; but does show the edits.

diff --git a/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/nvethernetrm/osi/core/xpcs.c b/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/nvethernetrm/osi/core/xpcs.c
--- a/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/nvethernetrm/osi/core/xpcs.c
+++ b/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/nvethernetrm/osi/core/xpcs.c
@@ -546,6 +546,13 @@ static nve32_t xpcs_check_pcs_lock_status(struct osi_core_priv_data *osi_core)
 			};
 
+	/* 25G fiber/DAC requires more time for PCS block lock than the
+	 * 1ms HW-team figure (measured for 10G). Allow up to 100ms.
+	 */
+	if (osi_core->uphy_gbe_mode == OSI_GBE_MODE_25G)
+		retry = 100U;
+
 	count = 0;
 	while (cond == COND_NOT_MET) {
 

diff --git a/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/ether_linux.h b/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/ether_linux.h
--- a/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/ether_linux.h
+++ b/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/ether_linux.h
@@ -733,6 +733,8 @@ struct ether_priv_data {
 	/** Ref count for set_speed_work_func */
 	atomic_t set_speed_ref_cnt;
+	/** Retry counter for set_speed_work_func exponential backoff */
+	unsigned int set_speed_retry_cnt;
 	/** flag to enable logs using ethtool */
 	u32 msg_enable;
 

diff --git a/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/ether_linux.c b/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/ether_linux.c
--- a/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/ether_linux.c
+++ b/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/ether_linux.c
@@ -1306,10 +1306,23 @@ void set_speed_work_func(struct work_struct *work)
 	ret = osi_handle_ioctl(pdata->osi_core, &ioctl_data);
 	if (ret < 0) {
-		netdev_dbg(dev, "Retry set speed\n");
+		unsigned int delay_ms;
+
+		/* Exponential backoff: 1s, 2s, 4s, 8s, 16s, 30s (capped).
+		 * Prevents workqueue flooding when lane bringup fails on
+		 * all four bonded MGBE interfaces simultaneously.
+		 */
+		if (pdata->set_speed_retry_cnt >= 5U)
+			delay_ms = 30000U;
+		else
+			delay_ms = 1000U << pdata->set_speed_retry_cnt;
+
+		pdata->set_speed_retry_cnt++;
+		netdev_dbg(dev, "Retry set speed in %ums (attempt %u)\n",
+			   delay_ms, pdata->set_speed_retry_cnt);
 		schedule_delayed_work(&pdata->set_speed_work,
-				      msecs_to_jiffies(1000));
+				      msecs_to_jiffies(delay_ms));
 		atomic_set(&pdata->set_speed_ref_cnt, OSI_DISABLE);
 		return;
 	}
@@ -1346,6 +1359,7 @@ void set_speed_work_func(struct work_struct *work)
 	netif_carrier_on(dev);
 
+	pdata->set_speed_retry_cnt = 0;
 	atomic_set(&pdata->set_speed_ref_cnt, OSI_DISABLE);
 }
 
@@ -1411,6 +1425,7 @@ static void ether_adjust_link(struct net_device *dev)
 
 	cancel_delayed_work_sync(&pdata->set_speed_work);
+	pdata->set_speed_retry_cnt = 0;
 	if (phydev->link) {
 
 		if (phydev->speed != pdata->speed) {

diff --git a/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/osd.c b/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/osd.c
--- a/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/osd.c
+++ b/nvidia-oot/drivers/net/ethernet/nvidia/nvethernet/osd.c
@@ -895,6 +895,7 @@ void ether_restart_lane_bringup_task(struct tasklet_struct *t)
 		netif_tx_stop_all_queues(pdata->ndev);
 		netif_tx_unlock(pdata->ndev);
+		pdata->set_speed_retry_cnt = 0;
 		schedule_delayed_work(&pdata->set_speed_work, msecs_to_jiffies(500));
 		if (netif_msg_drv(pdata)) {
 			netdev_info(pdata->ndev,

Hi,whitesscott

Excellent analysis. Your points address critical aspects that our RCU stall patch didn’t fully cover - particularly the PCS lock timeout being too short for 25G links, and the workqueue flooding from
fixed retry intervals.
The exponential backoff approach is exactly right for this scenario. I’ll test the patch.

Hi,

We have observed that the release in next Jetpack would not reproduce the issue with the method you provided.

Please wait for next one and let us know if you hit issue with that one.

Hi,WayneWWW

I wanted to check if the 100G speed will be improved in the next Jetpack release. If not, we probably won’t consider upgrading for now.