Network connection loss when TX ring full

Hi,

I recently applied the preempt-rt patches to my kernel and doing some optimization.

With streaming intensive ~950Mbit/s application on ETH0 I get this error

eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0

At this point the i lose the connection to the device. From terminal I can see that the interface is alive

eth0      Link encap:Ethernet  HWaddr 00:04:4B:E5:92:98  
          inet addr:192.168.1.100  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::204:4bff:fee5:9298/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:1322586 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6788940 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:68915744 (65.7 MiB)  TX bytes:78902510607 (73.4 GiB)
          Interrupt:40 

but I cannot ping my host PC at 192.168.1.2 and vice versa.

The only way to recover the connection is to restart the network manager. And I see these kernel messages:

[30001.363523] gpio tegra-gpio wake20 for gpio=52(G:4)
[30001.374531] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[30004.303672] eqos 2490000.ether_qos eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[30004.304901] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

Is there a way to gracefully recover from this error?

I cannot answer what you want to know, but it is somewhat interesting to see TX packet count go up so high, not have errors, dropped, overruns, nor carrier errors, and yet it fails. Prior to this failing are you able to ping the PC from the Jetson? Can you verify that the eth0 ifconfig above is from the Jetson side?

Can you also show the ifconfig for the other computer, along with route output for both Jetson and other computer?

@damien.lefevre

Getting the same issue, any insights?

For me. Can’t ping anyone on network from within the xavier and nmap on an external device can’t see the xavier on the network.
As for ifconfig, the interface doesnt even know that it’s dropped off:
image

Your ifconfig shows address 192.168.1.7 was configured. What is your output from the “route” command? The two work together. Same question on the host PC side…what is the output from “ifconfig” and “route”?

Note had to blank some stuff out due to privacy/security:
Route gave the following at the time:


Normally it give:

On the host side it is:

ifconfig of note is the wifi interface:

What exactly are you looking for here, can you please elaborate.

I was looking to see if the subnets matched (they do), and if the network interface on either side indicated a network setup issue (such as errors, drops, overruns, framing, collision…none seen).

At the external network level it appears that all is good. The error is probably internal to the Jetson (you already knew that, but worth mentioning), but it isn’t the network PHY itself at issue. My thought is that the application consuming data from the network simply is not fast enough (not necessarily the application’s fault; could for example be IRQ starvation or a lack of priority).

Note that if this is on the Jetson side, then your application is always sending data when the error hits:

The TX did not note any drops, and so this is quite curious. The thought here is that either the ifconfig was run prior to an error (confirm if the ifconfig was prior to or after the error), or else the data was dropped before making its way into the TX queue.

After the error occurs, if you use “less /proc/interrupts”, at the end of the file, is the “Err:” line still 0 (it should be)?

Hey @linuxdev,

Yeh it’s interesting, when you say application do you mean the singular or the “application layer”, since an nmap scan from an external device does not show the xavier on the network and from within the xavier it cannot even ping the gateway (btw unplugging and replugging the ethernet does not help) i think this may be an OS layer issue.

This issue is happening once a week, i have a few tests to run next time it happens (if maybe the new salt lamp i put next to it will fix it) and will repost here, anything you think worth testing (besides the “less” command).

Hi @thesyght

I haven’t bumped in the issue recently but I’ve been tweaking a lot of parameters.

One thing to note with PREEMT-RT is that each IRQ has its own kernel thread and the default priority is 50. So if your application/thread has higher priority it’ll preempt.

In case of the devkit NIC, there are 3 IRQs

watch -n 1 -d watch -n1 -d grep -e Err -e IPI -e eth -e CPU -e arch /proc/interrupts

Surprisingly enough, all IRQs are scheduled on CPU 0, although the default smp affinity allows scheduling on all cores

~#cat /proc/irq/default_smp_affinity
ff

I use a yocto based distro and I haven’t taken the time to flash the stock jetpack to check the behavior there. Maybe someone can confirm it is the same or not =). But it’s the same kernel.

You can manually change the affinity

echo 1 > /proc/irq/40/smp_affinity_list
echo 2 > /proc/irq/42/smp_affinity_list
echo 3 > /proc/irq/43/smp_affinity_list

To give affinity

  • ether_qos.common_irq - core 1
  • 2490000.ether_qos.rx0 - core 2
  • 2490000.ether_qos.tx0 - core 3

Note the setting is volatile and needs to be re-set after each boot. You’d need to add a start up script

The other thing that affect is the priority. Let say you cranked the niceness for your process to -20 and set priority to 99 on one or multiple threads. If the scheduler happens to schedule work on CPU 0 where all the IRQs appear to be handled, then they won’t get CPU time and this could explain the queue filling up.

In your code it’s easy to change the priority and CPU affinity for your threads, and the niceness.

For other processes / IRQs you can use the chrt utility.

You need to find the PID for the IRQ. In htop display options, make sure not to hide kernel threads and show custom thread names.

In /proc/interrupts, we saw that 2490000.ether_qos.tx0 is on IRQ 42, so if you search for irq/42 in htop, you get the corresponding PID (4532 for example).

To get to policy

# chrt -p 4532
pid 4532's current scheduling policy: SCHED_FIFO
pid 4532's current scheduling priority: 50

To set the RT (99) priority

chrt -f -p 99 4532

The policy options affects the jitter quite a bit. SCHED_OTHER it the default for non RT threads (priority is ignored). Then priorities are applied when using SCHED_FIFO and SCHED_RR.

Note that round robin gives up CPU time every 25ms by default

# cat /proc/sys/kernel/sched_rr_timeslice_ms
25

You can change that too.

So it’s quite a bargain on what to schedule were and it all depends on your application. I’ll be running more tests to see if I hit into this issue again. I would still not expect the network driver to “crash” for a full send queue.

Hopefully it was just a side effect of to aggressive scheduling

I mean any application at the RX end which consumes the data and results in the buffer clearing. This could be a driver, but since it has no errors or drops or overruns, all I can say is it is “something else other than the actual network activity”. I kind of stabbing in the dark, but I can tell you that the network itself has functioned flawlessly, and something related to the network is failing.

The TX is on the Jetson side, the RX on the other end. Note that you will see a packet count
via ifconfig as data is sent or received. Some of the traffic will be for DHCP setup, or perhaps some sort of route setup, but if you are not using ssh to log in, then you can be guaranteed that most of RX and TX packet count is from your application running. To that extent, without using ssh (is that possible?), then at each end you could run ifconfig on the particular interface and see if during normal operation count goes up approximately the same on both ends, and at the moment of failure, see if the RX end failed to read due to missing packets, or if instead RX read the packets and TX just stopped. Example:
watch -n 1 ifconfig eth0
(then watch TX packets at the Jetson, and RX packets at the other end)

This is where we can run into problems when the hardware IRQ aggregator sends everything to CPU0. Some hardware has access to all cores (e.g., memory controller and timers), other hardware can only interrupt CPU0. You can try to force affinity to a core without hardware IRQ access, but it would end up being rescheduled to CPU0 for such a case (I wish I knew more about the IRQ aggregator design).

I do not know what differences you will see when using Yocto, but it is kind of a “wild card”.

The scheduling experiments are a good idea, but be careful to leave CPU0 alone since it is the only core which handles some hardware.