I haven’t bumped in the issue recently but I’ve been tweaking a lot of parameters.
One thing to note with PREEMT-RT is that each IRQ has its own kernel thread and the default priority is 50. So if your application/thread has higher priority it’ll preempt.
In case of the devkit NIC, there are 3 IRQs
watch -n 1 -d watch -n1 -d grep -e Err -e IPI -e eth -e CPU -e arch /proc/interrupts
Surprisingly enough, all IRQs are scheduled on CPU 0, although the default smp affinity allows scheduling on all cores
I use a yocto based distro and I haven’t taken the time to flash the stock jetpack to check the behavior there. Maybe someone can confirm it is the same or not =). But it’s the same kernel.
You can manually change the affinity
echo 1 > /proc/irq/40/smp_affinity_list
echo 2 > /proc/irq/42/smp_affinity_list
echo 3 > /proc/irq/43/smp_affinity_list
To give affinity
- ether_qos.common_irq - core 1
- 2490000.ether_qos.rx0 - core 2
- 2490000.ether_qos.tx0 - core 3
Note the setting is volatile and needs to be re-set after each boot. You’d need to add a start up script
The other thing that affect is the priority. Let say you cranked the niceness for your process to -20 and set priority to 99 on one or multiple threads. If the scheduler happens to schedule work on CPU 0 where all the IRQs appear to be handled, then they won’t get CPU time and this could explain the queue filling up.
In your code it’s easy to change the priority and CPU affinity for your threads, and the niceness.
For other processes / IRQs you can use the chrt utility.
You need to find the PID for the IRQ. In htop display options, make sure not to hide kernel threads and show custom thread names.
In /proc/interrupts, we saw that 2490000.ether_qos.tx0 is on IRQ 42, so if you search for irq/42 in htop, you get the corresponding PID (4532 for example).
To get to policy
# chrt -p 4532
pid 4532's current scheduling policy: SCHED_FIFO
pid 4532's current scheduling priority: 50
To set the RT (99) priority
chrt -f -p 99 4532
The policy options affects the jitter quite a bit. SCHED_OTHER it the default for non RT threads (priority is ignored). Then priorities are applied when using SCHED_FIFO and SCHED_RR.
Note that round robin gives up CPU time every 25ms by default
# cat /proc/sys/kernel/sched_rr_timeslice_ms
You can change that too.
So it’s quite a bargain on what to schedule were and it all depends on your application. I’ll be running more tests to see if I hit into this issue again. I would still not expect the network driver to “crash” for a full send queue.
Hopefully it was just a side effect of to aggressive scheduling