Localhost/TCP Ping Latency Jetson TK1

Hello,

I was told by nvidia support to direct my question here.

uname -a
Linux tegra-ubuntu 3.10.40-ged4f697 #1 SMP PREEMPT Mon Dec 1 14:34:46 PST 2014 armv7l armv7l armv7l GNU/Linux

We are using ROS Indigo on this board for our robotic project. We noticed the on this version of Kernel we are having some networking problems.

Previous version running same our unmodified code works fine. Last project version we did try was 19.2. After upgrading to 21.2.1 we are having issues. Out program communicates over localhost/TCP. What we noticed that we sending messages from one program to the other (from publisher to subscriber all on Jetson).

Also looking further we noticed that ping times are quite large on this board comparing to boards.

Here is from jetson running 21.2.1. As you see always the first one is very quick overtime ping is restarted

ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.213 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=1.53 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=1.20 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=1.15 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=1.13 ms
64 bytes from localhost (127.0.0.1): icmp_seq=6 ttl=64 time=1.16 ms
64 bytes from localhost (127.0.0.1): icmp_seq=7 ttl=64 time=1.46 ms
64 bytes from localhost (127.0.0.1): icmp_seq=8 ttl=64 time=3.82 ms
64 bytes from localhost (127.0.0.1): icmp_seq=9 ttl=64 time=1.23 ms
64 bytes from localhost (127.0.0.1): icmp_seq=10 ttl=64 time=2.96 ms
64 bytes from localhost (127.0.0.1): icmp_seq=11 ttl=64 time=1.14 ms
64 bytes from localhost (127.0.0.1): icmp_seq=12 ttl=64 time=3.76 ms

Here is output from another ARM board (Not Nvidia)

ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.112 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.077 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.082 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.065 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=0.072 ms
64 bytes from localhost (127.0.0.1): icmp_seq=6 ttl=64 time=0.070 ms

And another x86 machine

ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.071 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.064 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.048 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.054 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=0.049 ms
64 bytes from localhost (127.0.0.1): icmp_seq=6 ttl=64 time=0.051 ms
ubuntu@tegra-ubuntu:~$ uname -a
Linux tegra-ubuntu 3.10.40-grinch-21.2.1 #8 SMP PREEMPT Sat Dec 13 11:27:12 UTC 2014 armv7l armv7l armv7l GNU/Linux
ubuntu@tegra-ubuntu:~$ ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.076 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.072 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.072 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=0.072 ms
64 bytes from localhost (127.0.0.1): icmp_seq=6 ttl=64 time=0.071 ms
64 bytes from localhost (127.0.0.1): icmp_seq=7 ttl=64 time=0.074 ms
64 bytes from localhost (127.0.0.1): icmp_seq=8 ttl=64 time=0.073 ms
64 bytes from localhost (127.0.0.1): icmp_seq=9 ttl=64 time=0.071 ms
64 bytes from localhost (127.0.0.1): icmp_seq=10 ttl=64 time=0.071 ms
64 bytes from localhost (127.0.0.1): icmp_seq=11 ttl=64 time=0.067 ms
64 bytes from localhost (127.0.0.1): icmp_seq=12 ttl=64 time=0.066 ms

No issues here.

It’s a known issue under Linux for the R21.x kernel to have issues at full gigabit speeds (it seems to be an issue of the NIC driver interacting with the scheduler…which mysteriously goes away not far into the kernel version future). This is not an issue under R19.x. I have however seen similar ping times under R19.x.

Those ping times are not necessarily “bad”, but it is unexpected and sometimes an issue where performance is required. Finding out more would require profiling which is not “quick”…you would need to rebuild the kernel for this and the issue might still be subtle to pinpoint (I do think this is an issue which needs to be addressed).

When I looked at this earlier (I’m not in a position to profile this) I made some observations as to possibilities. The first observation is that there is a dependence upon a hardware interrupt to begin the driver’s handling of the ping…if hardware interrupt handler start is delayed then so is ping.

Looking at /proc/interrupts you will see only one Jetson CPU core handling hardware interrupts. Quite some time ago Intel format CPUs were being supported for SMP on x86 and the default was to support hardware IRQ only on CPU0. In order to spread IRQs (I’ll stop saying “hardware” but realize software interrupts are not included in the conversation) across all CPUs an IO-APIC is required for these Intel CPUs beyond CPU0 (AMD had a different x86 scheme and did not require the IO-APIC). Deactivating Intel’s IO-APIC (or not having one) meant hardware being handled less often and less responsively via just one core.

Looking at the Jetson I’m not certain if the use of a single IRQ handling core is because of the need for the equivalent of an IO-APIC, or if (like early AMD multi-core CPUs) it is only a software-fixable issue not requiring hardware modification. But to make a long story short latencies on all single-core CPU hardware drivers will go up much faster as IRQ load goes up versus when many CPU cores service IRQs under increasing driver interrupt load. It’s called interrupt starvation if a hardware handler must wait due to other handlers holding control.

So if profiling is done I would look closely at the time it takes the IRQ handler to reach the driver code…versus whether the delays are within the driver itself. On R21.x you also have the complications of what I think is a flaw in the scheduler interactions (mysteriously fixed a few kernel versions later). Profiling really needs to be done with a kernel version beyond the R21.x kernels where scheduling and NIC drivers have no issue along with the R19.x kernel.

For the other non-Jetson ARM boards you used, how many are multi-core? Do the multi-core versions show single core CPU0 in /proc/interrupts?

I noticed that same behaviour as in the first post. I don’t think it’s related to any physical NIC driver as it’s just a localhost.

If I force all cores online, the ping drops from 1+ms to 0.2ms. If I then force all the cores to max clocks, the ping drops to 0.018ms.

More about the performance settings are in the wiki:
http://elinux.org/Jetson/Performance#Maximizing_CPU_performance

In a robotic project it might be possible/meaningful to set governor (or frequencies) manually for idle/normal cases to get power consumption down in idle mode and response to max on normal mode. Running with max clocks doesn’t increase the power consumption drastically (depends on the goals though), if the system is otherwise idle. I measured 0.4W increase.

@linuxdev Does irqbalancer need to be installed? Im pretty sure its for intel chips only… but just wondering if i may need it because it wasn’t installed.

When i looked in the /proc/interrupts i saw that there was only one core running, with all the interrupts on that core. After following @kulve link for maximizing CPU performance, i enabled all the cores and then looked back again to see that all four cores were now present, but only CPU0 was taking care of all the interrupts and no interrupts where scheduled on CPU1-3. Turning on all 4 Cores manually did however improve the ping time to ~0.2ms (without any big processes running) and then once i ran a ROS node the localhost ping went down to ~0.02ms. If i exited the ROS node, the ping would stay the same at ~0.02ms.

Enabling the cores manually will only work until reboot, so would you or anyone know how i could permanently set all CPU’s online and disable tegra_cpuquiet ?

I also looked at the other ARM board i have (Radxa Rock Pro) and when i looked at the /proc/interrupts folder, interrupts where distributed amungest all the cores like such:

rock@radxa:~$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 32:          0          0          0          0       GIC  rk29-pl330.1
 33:          0          0          0          0       GIC  rk29-pl330.1
 34:      13033          0          0          0       GIC  rk29-pl330.2
 35:          0          0          0          0       GIC  rk29-pl330.2
 38:          0          0          0          0       GIC  bvalid
 41:          0          0          0          0       GIC  vepu
 42:          0          0          0          0       GIC  vdpu
 43:          0          0          0          0       GIC  rk3066b-camera
 46:      19244          0          0          0       GIC  rk30-lcdc.1
 48:          0          0          0          0       GIC  dwc_otg, dwc_otg_hcd:usb1, dwc_otg_pcd
 49:    1293956          0          0          0       GIC  dwc_otg, host20_hcd:usb2
 51:       1591          0          0          0       GIC  eth0
 55:       6266          0          0          0       GIC  rk29_sdmmc.0
 58:       1624          0          0          0       GIC  rk30-adc
 70:          0          0          0          0       GIC  rk29xx_spim
 71:          0          0          0          0       GIC  rk29xx_spim
 72:          0          0          0          0       GIC  rk30_i2c.0
 73:       2181          0          0          0       GIC  rk30_i2c.1
 74:       1077          0          0          0       GIC  rk30_i2c.2
 75:          0          0          0          0       GIC  rk30_i2c.3
 76:       6540          0          0          0       GIC  rk_timer0
 77:          0       1378          0          0       GIC  rk_timer1
 84:          0          0          0          0       GIC  rk30_i2c.4
 91:          0          0       1554          0       GIC  rk_timer2
 92:          0          0          0       2450       GIC  rk_timer3
 95:          0          0          0          0       GIC  rga
112:          0          0          0          0       GIC  debug-signal
164:          0          0          0          0      GPIO  play
165:          0          0          0          0      GPIO  bt_default_wake_host_irq
196:          0          0          0          0      GPIO  rtc_hym8563
FIQ:              fiq_glue
IPI0:          0          0          0          0  Timer broadcast interrupts
IPI1:       1809       2065       3253       2188  Rescheduling interrupts
IPI2:          8         16         18         15  Function call interrupts
IPI3:          0          0          0          0  Single function call interrupts
IPI4:          0          0          0          0  CPU stop interrupts
IPI5:          0          0          0          0  CPU backtrace
LOC:          0          0          0          0  Local timer interrupts
Err:          0

and /proc/interrupts for the jetson looks like this (only after enabling all 4 cores, not just CPU0):

ubuntu@tegra-ubuntu:~$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 29:     545326     125359      81126      65955       GIC  arch_timer
 30:          0          0          0          0       GIC  arch_timer
 32:         52          0          0          0       GIC  timer0
 36:          0          0          0          0       GIC  nvavp
 51:          0          0          0          0       GIC  mmc1
 52:          0          0          0          0       GIC  tegra-otg, tegra-udc
 53:          0          0          0          0       GIC  ehci_hcd:usb1
 55:          0          0          0          0       GIC  tegra-sata
 63:      49545          0          0          0       GIC  mmc0
 70:       1767          0          0          0       GIC  tegra12-i2c.0
 73:          0          0          0          0       GIC  tmr_lp2wake_cpu0
 74:          0          0          0          0       GIC  tmr_lp2wake_cpu1
 77:      38874          0          0          0       GIC  tegra_actmon, tegra_actmon, tegra_mon
 80:        185          0          0          0       GIC  soctherm_thermal
 83:          0          0          0          0       GIC  soctherm_edp
 85:       6206          0          0          0       GIC  tegra12-i2c.4
 90:          0          0          0          0       GIC  tegra-se
 91:          0          0          0          0       GIC  spi-tegra114.0
 95:          0          0          0          0       GIC  tegra12-i2c.5
 97:      83115          0          0          0       GIC  host_syncpt
 99:          0          0          0          0       GIC  host_status
101:          0          0          0          0       GIC  vi.0
103:          0          0          0          0       GIC  tegra-isp-isr
104:          0          0          0          0       GIC  tegra-isp-isr
105:          7          0          0          0       GIC  tegradc.0
106:      11249          0          0          0       GIC  tegradc.1
109:          0          0          0          0       GIC  mc_status
113:        552          0          0          0       GIC  snd_hda_intel
116:          0          0          0          0       GIC  tegra12-i2c.1
118:          0          0          0          0       GIC  as3722
122:        492          0          0          0       GIC  serial
124:          0          0          0          0       GIC  tegra12-i2c.2
125:          0          0          0          0       GIC  spi-tegra114.3
129:      28356          0          0          0       GIC  ehci_hcd:usb2
130:          2          0          0          0       GIC  PCIE
131:      14299          0          0          0       GIC  PCIe-MSI
136:          0          0          0          0       GIC  apbdma.0
137:          0          0          0          0       GIC  apbdma.1
138:          0          0          0          0       GIC  apbdma.2
139:          0          0          0          0       GIC  apbdma.3
140:          0          0          0          0       GIC  apbdma.4
141:        356          0          0          0       GIC  apbdma.5
142:        336          0          0          0       GIC  apbdma.6
143:          0          0          0          0       GIC  apbdma.7
144:          0          0          0          0       GIC  apbdma.8
145:          0          0          0          0       GIC  apbdma.9
146:          0          0          0          0       GIC  apbdma.10
147:          0          0          0          0       GIC  apbdma.11
148:          0          0          0          0       GIC  apbdma.12
149:          0          0          0          0       GIC  apbdma.13
150:          0          0          0          0       GIC  apbdma.14
151:          0          0          0          0       GIC  apbdma.15
152:        136          0          0          0       GIC  tegra12-i2c.3
153:          0          0          0          0       GIC  tmr_lp2wake_cpu2
159:          0          0          0          0       GIC  hier_ictlr_irq
160:          0          0          0          0       GIC  apbdma.16
161:          0          0          0          0       GIC  apbdma.17
162:          0          0          0          0       GIC  apbdma.18
163:          0          0          0          0       GIC  apbdma.19
164:          0          0          0          0       GIC  apbdma.20
165:          0          0          0          0       GIC  apbdma.21
166:          0          0          0          0       GIC  apbdma.22
167:          0          0          0          0       GIC  apbdma.23
168:          0          0          0          0       GIC  apbdma.24
169:          0          0          0          0       GIC  apbdma.25
170:          0          0          0          0       GIC  apbdma.26
171:          0          0          0          0       GIC  apbdma.27
172:          0          0          0          0       GIC  apbdma.28
173:          0          0          0          0       GIC  apbdma.29
174:          0          0          0          0       GIC  apbdma.30
175:          0          0          0          0       GIC  apbdma.31
183:          0          0          0          0       GIC  tmr_lp2wake_cpu3
189:      11588          0          0          0       GIC  gk20a_stall
190:          0          0          0          0       GIC  gk20a_nonstall
247:          1          0          0          0      GPIO  nct72
288:          2          0          0          0      GPIO  tegradc.1
305:          0          0          0          0      GPIO  Power
320:          0          0          0          0      GPIO  headphone detect
347:          0          0          0          0      GPIO  mmc1
449:          0          0          0          0    as3722  rtc-alarm
464:          0          0          0          0    as3722  as3722-adc-extcon.2
640:          0          0          0          0  PCIe-MSI  PCIe PME, aerdrv
641:      14301          0          0          0  PCIe-MSI  eth0
IPI0:          0          0          0          0  CPU wakeup interrupts
IPI1:          0          0          0          0  Timer broadcast interrupts
IPI2:      36627      67025      33896      20853  Rescheduling interrupts
IPI3:         59        173        187        178  Function call interrupts
IPI4:          1        545        239        200  Single function call interrupts
IPI5:          0          0          0          0  CPU stop interrupts
IPI6:          0          0          0          0  CPU backtrace
Err:          0

I don’t know about the IRQ balancing question…it may or may not apply to this architecture. Something to note on a system which is truly capable of all CPU cores handling hardware IRQ (such as an IO-APIC enabled Intel x86 or desktop AMD CPU) is that driver interrupts for each individual piece of hardware (such as NIC, drive controllers, video, etc) can occur on any CPU core (though a certain amount of sticking to a single core has an advantage so far as caching is concerned in a case where cache has already loaded) without the same level of software support as needed to fake this. There may be some scheme where software assigns a specific driver IRQ to a specific core, but this is more of a bandage compared to hardware automatically making any core capable of handling an IRQ based on priority and core availability. It’s quite possible that a software-scheme must still see the IRQ on CPU0 and then hand it off…a hardware scheme should be able to simultaneously handle hardware IRQs on any core without stalling from one core handing off to another core.

I still do not know the answer as to whether the Cortex A15 requires the equivalent of an IO-APIC or not…without knowing that it isn’t possible to say what is needed to distribute hardware IRQs.

The difference between 0.02ms and 0.2ms is because of the CPU clock. If you lock the clocks as I mentioned in my previous post, you don’t see the variation in ping times.

I linked the performance wiki page in my previous post. The page lists one way to run commands on boot: edit /etc/rc.local. Just make sure it doesn’t exit on first error as setting already online cores to online will cause an error.

So the ping times do not stay the same, even if i manually set the frequencies. If i have no major processes running, i get ping times ~0.153ms, and then once i run a process like roscore which is the master communicator for the ros node system, they suddenly drop to ~0.025ms.

This however, still does not solve main issue. I listed this thread as a issue of ping times, because ros runs through IPC sockets. But i actual issue is that I am sending/receiving commands over USB (serial in actuality through /dev/ttyACMx) to my robot’s motor controller, and for some reason when i am sending a consistent stream of motor commands to the port, there are random stutters in the movement of the motors. I tested this with the same code i use to send commands through the serial port with the Radxa Rock as well as x86 computer, and never could reproduce this error on them, this only happens on jetson.

Now im a bit lost on what the actual issue is, since the ping times are low now, but I still can’t send commands consistently at say 10Hz.

Only thing i can think of now though, is how linuxdev mentioned that all the hardware interrupts run on one core. Im not sure if that has anything to do with my problem though. As i said before, on the Radxa Rock under /proc/interrupts, the interrupts are distributed across all 4 cores.

Anybody have any input on why I cannot send commands through a serial port consistently?

Unless there is buffering USB itself might stutter. If there is a USB HUB in between Jetson and the device this could add to the issue. And of course USB is something which also requires CPU0 to service interrupts.

I took out the USB hub a while ago thinking this may have been the issue, so this problem is arising with a straight connection between the motor controller and jetsons usb port.

If you have a particular process running for this communications you might renice it. Default nice is 0, setting it to -1 might improve things.

Wont that process have different PID’s each different time it runs? That seems like a temporary stitch that could break easily

Yes, that is correct if using something like top or renice to accomplish this. You can also start your program via the “nice” command if you are starting it as root.