We are running Thor with the four MGBE ports in 10GbE mode, streaming data to a link partner. In a previous thread (Jetson Thor - MGBE TX Pause Frames), we resolved TX Pause Frame support via a device tree edit — thanks again to @whitesscott and @WayneWWW. With pause frames working, we never over-drive the link partner. That problem is solved.
The next issue we’re chasing is large inter-packet gaps on the TX path. We are streaming at approximately 8 Gbps per port. Our packets are ~8 KB, so nominally they depart every ~8 µs — about 125K packets per second. Most of the time they do, often faster. However, we occasionally observe inter-packet gaps of up to 6 ms. Our link partner has enough buffering to tolerate about 1 ms of gap, so these 6 ms gaps cause data loss on the receiving end.
To be clear: the source data is available well ahead of time in a DRAM buffer. There is no producer-side latency race. The gaps appear to be introduced somewhere in the TX path between our userspace application and the wire — likely Linux scheduling, the kernel network stack, or the MGBE driver/DMA engine.
We’d appreciate any guidance on:
-
What tools or instrumentation are available on Thor to diagnose where these gaps originate? (e.g., ethtool stats, MGBE driver debug, tracing, etc.)
-
Are there known tuning knobs — sysctl settings, IRQ affinity, NAPI parameters, TX coalescing, ring buffer sizes — that can reduce worst-case TX latency on the MGBE?
-
Is there a kernel-bypass or zero-copy TX path available for the MGBE on Thor? (AF_XDP, DPDK, Holoscan, or similar?)
-
Any known issues with Linux scheduler interaction and the MGBE driver that could explain multi-millisecond stalls?
We are running jetson-r38.4.0. Any suggestions or pointers would be greatly appreciated.
Thank you in advance! Shep
Something here might be helpful.
# Disable flow control entirely on the Jetson to see if the gaps disappear. You may need to disable it on the switch/receiver too.
ethtool -A ethX rx off tx off
# If the TX ring buffer is too small, or if interrupt coalescing is holding off TX completion interrupts for too long,
the MGBE driver will run out of TX descriptors.
# find ring buffer values
ethtool -g mgbe0_0
# Increase the TX ring buffer to its maximum supported size:
ethtool -G mgbe0_0 tx <max_value>
# Tune interrupt coalescing. Disable "Adaptive TX" and set a hard, low limit for TX usecs and frames.
ethtool -C mgbe0_0 adaptive-tx off tx-usecs 50 tx-frames 32
# Lock the Jetson clocks to maximum performance to eliminate frequency scaling and deep idle states:
sudo jetson_clocks
# Try disabling offloads to force the stack to push discrete packets.
ethtool -K mgbe0_0 tso off gso off sg off
Ftrace
Prepare the ftrace environment
sudo su
cd /sys/kernel/debug/tracing
# Reset and set tracer
echo 0 > tracing_on
echo > trace
echo function_graph > current_tracer
# Clear old filters
echo > set_ftrace_filter
# Add core network TX boundaries and queue controls
echo 'dev_hard_start_xmit' >> set_ftrace_filter
echo 'netif_tx_stop_queue' >> set_ftrace_filter
echo 'netif_tx_wake_queue' >> set_ftrace_filter
# Add everything in the nvethernet module
echo ':mod:nvethernet' >> set_ftrace_filter
# Set the 5 ms trap (5000 microseconds)
echo 5000 > tracing_thresh
# Start tracing
echo 1 > tracing_on
# Let your 8 Gbps stream run while tracing is on. Wait until your receiving link partner reports the ~6 ms gap. Immediately stop the trace to prevent the ring buffer from wrapping:
echo 0 > tracing_on
cat trace > /home/$SUDO_USER/nvethernet_stall_trace.txt
Open the trace file, one of these pattern may be shown.
A Hardware/Bus Stall
If the delay is happening at the very bottom of the stack when the driver is physically handing the packet descriptors to the MGBE hardware.
Your trace may look something like this:
# tracer: function_graph
# CPU DURATION FUNCTION CALLS
# | | | | | | |
4) ! 6005.12 us | dev_hard_start_xmit() {
4) ! 6002.88 us | nve_start_xmit();
4) ! 6005.50 us | }
Impression: The kernel is perfectly healthy, but the nve_start_xmit function (which writes to the PCIe/memory-mapped registers of the MGBE) was physically blocked from completing for 6 milliseconds.
This is may be the hardware honoring an incoming 802.3x Pause Frame from your link partner;
or a severe DMA/SMMU translation stall on the Thor's memory controller.
Ring Buffer Starvation, the queue stalls
If the TX ring buffer fills up because the hardware isn't draining it fast enough (or the kernel isn't cleaning up transmitted packets fast enough),
the driver will forcefully stop the kernel's network queue.
# tracer: function_graph
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) 1.22 us | netif_tx_stop_queue();
... (6 milliseconds of absolutely nothing on this CPU) ...
2) 0.95 us | netif_tx_wake_queue();
Note: You might have to look at the absolute timestamps on the far left of the full trace to see the 6 ms time jump between the stop and the wake.
The driver ran out of TX descriptors. The kernel stopped sending. It took 6 ms for the hardware to finally trigger a TX Completion Interrupt (or for the CPU to wake up and process the SoftIRQ) to clean out the old packets and wake the queue.
Interrupt coalescing may be set too high, the TX ring buffer is too small, or the CPU core handling the IRQ went to sleep (deep C-state) and took too long to wake up.
CPU Preemption / OS Jitter
If the delay happens higher up in the core network stack before it even reaches the nvethernet driver, you will see something like this:
# tracer: function_graph
# CPU DURATION FUNCTION CALLS
# | | | | | | |
5) ! 6150.00 us | dev_hard_start_xmit() {
5) 2.15 us | nve_start_xmit();
5) ! 6155.10 us | }
In this example the nve_start_xmit only took 2 microseconds, but the parent function dev_hard_start_xmit took over 6 ms.
This means the CPU was preempted right in the middle of executing the network stack.
The kernel scheduler may have yanked the CPU core away from your network thread to do something else and didn't give it back for 6 ms.