Kernel panic with jumbo frames in L4T 32.5.1 / TX2 4GB

aki.reijonen · March 17, 2021, 5:10pm

I have been using higher MTU for performance reasons successfully with earlier L4T releases, for example with 32.4.4. However in L4T 32.5.1 the kernel panics quite soon after MTU is increased. Is this a known problem/is there a fix or workaround available?

Most of the time it’s something related to SKB (some buffer size mismatches as MTU is increased?). Example session:

$ sudo ifconfig eth0 mtu 5000
$ dmesg|tail
[ 31.814700] vdd-3v3: disabling
[ 31.814711] en-vdd-vcm-2v8: disabling
[ 31.814722] vdd-sys-bl: disabling
[ 31.814731] en-vdd-sys: disabling
[ 48.754054] eqos 2490000.ether_qos: changing MTU from 1500 to 5000
[ 48.754389] bcm54xx_low_power_mode(): put phy in iddq-lp mode
[ 48.836876] gpio tegra-gpio wake18 for gpio=101(M:5)
[ 52.308129] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 52.375521] eqos 2490000.ether_qos eth0: Link is Up - 1Gbps/Full - flow control off
[ 52.376847] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

In this case it crashes only four seconds after link is back up:

[ 56.782076] kernel BUG at /dvs/git/dirty/git-master_linux/kernel/kernel-4.9/mm/slub.c:3873!
[ 56.790680] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 56.796320] Modules linked in: zram 8021q garp mrp spidev ov5693 overlay userspace_alert nvgpu bluedroid_pm ip_tables x_tables
[ 56.808394] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.9.201-tegra #1
[ 56.815269] Hardware name: lightning (DT)
[ 56.819397] task: ffffffc0f0f11c00 task.stack: ffffffc0f0f34000
[ 56.825487] PC is at kfree+0x254/0x2a8
[ 56.829351] LR is at skb_free_head+0x28/0x48
[ 56.833741] pc : [] lr : [] pstate: 40400045
[ 56.841324] sp : ffffffc0f0f377b0
[ 56.844734] x29: ffffffc0f0f377b0 x28: ffffffc0ef3ea900
[ 56.850246] x27: ffffffc0e59a9700 x26: 0000000000000000
[ 56.855756] x25: ffffff800a077100 x24: ffffffc0c1fcc064
[ 56.861262] x23: ffffffc0d0f89a28 x22: ffffffc0c1fcc000
[ 56.866767] x21: ffffffc0d0f89a00 x20: ffffff8008d96710
[ 56.872271] x19: ffffffbf0307f300 x18: 000000000000088e
[ 56.877776] x17: 0000000000000000 x16: 0000000000000000
[ 56.883281] x15: 000000000003d20a x14: 0000000000000001
[ 56.888784] x13: 0000000000000000 x12: 0000000000000008
[ 56.894286] x11: ffffff800917c030 x10: 0000000000000008
[ 56.899793] x9 : 0000000000000000 x8 : 00000000000000c8
[ 56.905299] x7 : 0000000000000001 x6 : ffffffc0d7d6ec80
[ 56.910806] x5 : ffffffc0f0f11c00 x4 : ffffffc0f5f80140
[ 56.916313] x3 : ffffffc0c1fcc000 x2 : 0000000000001ec0
[ 56.921816] x1 : 0000000000000000 x0 : 0000000000000000
[ 56.927319]
[ 56.928868] Process ksoftirqd/0 (pid: 3, stack limit = 0xffffffc0f0f34000)
[ 56.935919] Call trace:
[ 56.938453] [] kfree+0x254/0x2a8
[ 56.943379] [] skb_free_head+0x28/0x48
[ 56.948837] [] skb_release_data+0x100/0x130
[ 56.954736] [] skb_release_all+0x30/0x40
[ 56.960369] [] kfree_skb+0x40/0x120
[ 56.965560] [] __udp4_lib_rcv+0x6a8/0xaa8
[ 56.971282] [] udp_rcv+0x30/0x40
[ 56.976207] [] ip_local_deliver_finish+0x80/0x280
[ 56.982639] [] ip_local_deliver+0x54/0xf0
[ 56.988360] [] ip_rcv_finish+0x1f8/0x380
[ 56.993992] [] ip_rcv+0x284/0x390
[ 56.999007] [] __netif_receive_skb_core+0x3b8/0xad8
[ 57.005620] [] __netif_receive_skb+0x28/0x78
[ 57.011612] [] netif_receive_skb_internal+0x2c/0xb0
[ 57.018226] [] napi_gro_receive+0x15c/0x188
[ 57.024131] [] eqos_napi_poll_rx+0x368/0x4f8
[ 57.030120] [] net_rx_action+0xf4/0x358
[ 57.035666] [] __do_softirq+0x13c/0x3b0
[ 57.041215] [] run_ksoftirqd+0x48/0x58
[ 57.046676] [] smpboot_thread_fn+0x160/0x248
[ 57.052663] [] kthread+0xec/0xf0
[ 57.065283] [] ret_from_fork+0x10/0x30
[ 57.078292] —[ end trace c485d8a89eef9e66 ]—
[ 57.120358] Kernel panic - not syncing: Fatal exception in interrupt
[ 57.134340] SMP: stopping secondary CPUs
[ 57.145782] Kernel Offset: disabled
[ 57.156661] Memory Limit: none
[ 57.167064] trusty-log panic notifier - trusty version Built: 19:52:41 Mar 2 2021 [ 57.204251] Rebooting in 5 seconds…

WayneWWW · March 18, 2021, 5:40am

Sounds an issue here. What is your method to setup jumbo frame?

aki.reijonen · March 18, 2021, 9:50am

Not sure what you mean. Here I just manually set the higher MTU with ifconfig (ifconfig eth0 mtu 5000 for example). Seems that my formatting was a bit funky in the original message so maybe it was not clear.

Actual use case is that an FPGA chip sends data to Tegra as jumbo frames (UDP) which causes the crash. I now left the data transmission disabled and instead sent jumbo frames from a desktop PC using mausezahn and it did not crash after several minutes, so I’ll try to investigate whether the packet contents make a difference here.

Tagged virtual LANs are also being used so there might be some issue related to that - connection to FPGA uses different VLAN than to desktop PC.

WayneWWW · March 18, 2021, 9:52am

What if I want to reproduce your issue with devkit? If it possible for me to reproduce this issue with my device? There is no FPGA chip here so may need an alternative.

Please also note that we need this issue to be reproduced on devkit so that we can help check.

aki.reijonen · March 18, 2021, 10:10am

In these tests, Tegra is mounted to the devkit instead of our own carrier board so there is no issue in that part. I will try to find a simple way to reproduce the issue without any special hardware connected to it.

WayneWWW · March 18, 2021, 10:42am

Thanks. Waiting for your reply.

aki.reijonen · March 18, 2021, 11:21am

I now can reproduce it with mausezahn from my linux desktop PC. There is some relation between MTU and packet sizes that work. Now I found that:

Set MTU 5000 on tegra => packet size 2000 cause crash (1000, 3000 ok)
Set MTU 9000 on tegra => packet sizes 1000,2000,3000 cause crash

This is still using tagged VLAN though, but my guess is that they have no effect on the issue. To make the ethernet frames same size without VLAN tagging, 4 bytes could be added to the packet size. It does seem that the exact packet size does not matter so much. For example, I tried sending 2000-4=1996 bytes per packet and it also crashed immediately.

For mausezahn in my linux desktop PC, I used:

mz <interface> -B <tegra-ip> -t udp “dp=12345” -p <packet-size> -c 1000000

There does not need to be anything on the tegra side listening in the UDP port - I just chose port 12345 randomly.

Is the mz command enough for you to reproduce this issue?

I will also later on make sure that the issue persists without any VLANs being used (need to redo the physical setup).

aki.reijonen · March 18, 2021, 3:35pm

I have now checked that the tagged VLANs don’t make much of an difference: tested with direct network cable from desktop PC to tegra and no VLANs configured on either end. It still fails.

Although seems slightly less easy to break it, as it often does not crash immediately.

WayneWWW · March 22, 2021, 2:52am

There does not need to be anything on the tegra side listening in the UDP port - I just chose port 12345 randomly.

Just want to make sure. So tegra side does not need to set up any client/server application but just put it idle with ethernet cable connected to host and jumbo frame to maybe 5000?

aki.reijonen · March 22, 2021, 8:53am

Yes, that is correct.

WayneWWW · March 22, 2021, 8:58am

We can reproduce this issue easily even with iperf tool.

Will try to find out the cause. Thanks for report.

mark.goodall · August 31, 2021, 9:18am

I’m experiencing the exact same issue on a Xavier, is there a fix for this yet other than downgrading?

WayneWWW · August 31, 2021, 9:46am

Sorry, not yet. Still investigating.

mark.goodall · August 31, 2021, 10:48am

Thanks, is there an eta or issue I can track somewhere (like https://bugzilla.kernel.org/)?

WayneWWW · August 31, 2021, 10:57am

I will update the solution here once we have one.

deepak.talwar1 · September 4, 2021, 1:06am

We are facing a similar issue with jumbo frames being sent from a camera on L4T32.5.1 on Jetson Xavier NX. Please advise us on what the fix may be.

WayneWWW · October 18, 2021, 7:24am

Hi,

We got this resolved. Please add this patch or wait for next release (after rel32.6.1).

diff --git a/drivers/net/ethernet/nvidia/eqos/drv.c b/drivers/net/ethernet/nvidia/eqos/drv.c
index 7c7bb26..6e7c98f 100644
--- a/drivers/net/ethernet/nvidia/eqos/drv.c
+++ b/drivers/net/ethernet/nvidia/eqos/drv.c
@@ -2329,8 +2329,15 @@
 #ifdef EQOS_ENABLE_RX_DESC_DUMP
 		dump_rx_desc(qinx, prx_desc, entry);
 #endif
+
+		/* Process rx packets which takes only 1 rx desc buffer
+		 * and drop other packets which are spread across
+		 * descriptors due to MTU mismatch. Do not free the
+		 * buffers but reuse the mapped skb buffer again.
+		 */
 		if (likely(!(status & EQOS_RDESC3_ES_BITS) &&
-			   (status & EQOS_RDESC3_LD))) {
+			   (status & EQOS_RDESC3_LD) &&
+			   (status & EQOS_RDESC3_FD))) {
 			/* Unmap the SKB */
 			skb = prx_swcx_desc->skb;
 			prx_swcx_desc->skb = NULL;
@@ -2364,10 +2371,8 @@
 			}
 
 			eqos_receive_skb(pdata, dev, skb, qinx);
-		} else {
+		} else
 			eqos_update_rx_errors(dev, status);
-			dev_kfree_skb_any(prx_swcx_desc->skb);
-		}
 
 		received++;
 		if (eqos_rx_dirty(prx_ring) >=