I was told by nvidia support to direct my question here.
uname -a
Linux tegra-ubuntu 3.10.40-ged4f697 #1 SMP PREEMPT Mon Dec 1 14:34:46 PST 2014 armv7l armv7l armv7l GNU/Linux
We are using ROS Indigo on this board for our robotic project. We noticed the on this version of Kernel we are having some networking problems.
Previous version running same our unmodified code works fine. Last project version we did try was 19.2. After upgrading to 21.2.1 we are having issues. Out program communicates over localhost/TCP. What we noticed that we sending messages from one program to the other (from publisher to subscriber all on Jetson).
Also looking further we noticed that ping times are quite large on this board comparing to boards.
Here is from jetson running 21.2.1. As you see always the first one is very quick overtime ping is restarted
ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.213 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=1.53 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=1.20 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=1.15 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=1.13 ms
64 bytes from localhost (127.0.0.1): icmp_seq=6 ttl=64 time=1.16 ms
64 bytes from localhost (127.0.0.1): icmp_seq=7 ttl=64 time=1.46 ms
64 bytes from localhost (127.0.0.1): icmp_seq=8 ttl=64 time=3.82 ms
64 bytes from localhost (127.0.0.1): icmp_seq=9 ttl=64 time=1.23 ms
64 bytes from localhost (127.0.0.1): icmp_seq=10 ttl=64 time=2.96 ms
64 bytes from localhost (127.0.0.1): icmp_seq=11 ttl=64 time=1.14 ms
64 bytes from localhost (127.0.0.1): icmp_seq=12 ttl=64 time=3.76 ms
Here is output from another ARM board (Not Nvidia)
ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.112 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.077 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.082 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.065 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=0.072 ms
64 bytes from localhost (127.0.0.1): icmp_seq=6 ttl=64 time=0.070 ms
And another x86 machine
ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.071 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.064 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.048 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.054 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=0.049 ms
64 bytes from localhost (127.0.0.1): icmp_seq=6 ttl=64 time=0.051 ms
ubuntu@tegra-ubuntu:~$ uname -a
Linux tegra-ubuntu 3.10.40-grinch-21.2.1 #8 SMP PREEMPT Sat Dec 13 11:27:12 UTC 2014 armv7l armv7l armv7l GNU/Linux
ubuntu@tegra-ubuntu:~$ ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.076 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.072 ms
64 bytes from localhost (127.0.0.1): icmp_seq=4 ttl=64 time=0.072 ms
64 bytes from localhost (127.0.0.1): icmp_seq=5 ttl=64 time=0.072 ms
64 bytes from localhost (127.0.0.1): icmp_seq=6 ttl=64 time=0.071 ms
64 bytes from localhost (127.0.0.1): icmp_seq=7 ttl=64 time=0.074 ms
64 bytes from localhost (127.0.0.1): icmp_seq=8 ttl=64 time=0.073 ms
64 bytes from localhost (127.0.0.1): icmp_seq=9 ttl=64 time=0.071 ms
64 bytes from localhost (127.0.0.1): icmp_seq=10 ttl=64 time=0.071 ms
64 bytes from localhost (127.0.0.1): icmp_seq=11 ttl=64 time=0.067 ms
64 bytes from localhost (127.0.0.1): icmp_seq=12 ttl=64 time=0.066 ms
It’s a known issue under Linux for the R21.x kernel to have issues at full gigabit speeds (it seems to be an issue of the NIC driver interacting with the scheduler…which mysteriously goes away not far into the kernel version future). This is not an issue under R19.x. I have however seen similar ping times under R19.x.
Those ping times are not necessarily “bad”, but it is unexpected and sometimes an issue where performance is required. Finding out more would require profiling which is not “quick”…you would need to rebuild the kernel for this and the issue might still be subtle to pinpoint (I do think this is an issue which needs to be addressed).
When I looked at this earlier (I’m not in a position to profile this) I made some observations as to possibilities. The first observation is that there is a dependence upon a hardware interrupt to begin the driver’s handling of the ping…if hardware interrupt handler start is delayed then so is ping.
Looking at /proc/interrupts you will see only one Jetson CPU core handling hardware interrupts. Quite some time ago Intel format CPUs were being supported for SMP on x86 and the default was to support hardware IRQ only on CPU0. In order to spread IRQs (I’ll stop saying “hardware” but realize software interrupts are not included in the conversation) across all CPUs an IO-APIC is required for these Intel CPUs beyond CPU0 (AMD had a different x86 scheme and did not require the IO-APIC). Deactivating Intel’s IO-APIC (or not having one) meant hardware being handled less often and less responsively via just one core.
Looking at the Jetson I’m not certain if the use of a single IRQ handling core is because of the need for the equivalent of an IO-APIC, or if (like early AMD multi-core CPUs) it is only a software-fixable issue not requiring hardware modification. But to make a long story short latencies on all single-core CPU hardware drivers will go up much faster as IRQ load goes up versus when many CPU cores service IRQs under increasing driver interrupt load. It’s called interrupt starvation if a hardware handler must wait due to other handlers holding control.
So if profiling is done I would look closely at the time it takes the IRQ handler to reach the driver code…versus whether the delays are within the driver itself. On R21.x you also have the complications of what I think is a flaw in the scheduler interactions (mysteriously fixed a few kernel versions later). Profiling really needs to be done with a kernel version beyond the R21.x kernels where scheduling and NIC drivers have no issue along with the R19.x kernel.
For the other non-Jetson ARM boards you used, how many are multi-core? Do the multi-core versions show single core CPU0 in /proc/interrupts?
In a robotic project it might be possible/meaningful to set governor (or frequencies) manually for idle/normal cases to get power consumption down in idle mode and response to max on normal mode. Running with max clocks doesn’t increase the power consumption drastically (depends on the goals though), if the system is otherwise idle. I measured 0.4W increase.
@linuxdev Does irqbalancer need to be installed? Im pretty sure its for intel chips only… but just wondering if i may need it because it wasn’t installed.
When i looked in the /proc/interrupts i saw that there was only one core running, with all the interrupts on that core. After following @kulve link for maximizing CPU performance, i enabled all the cores and then looked back again to see that all four cores were now present, but only CPU0 was taking care of all the interrupts and no interrupts where scheduled on CPU1-3. Turning on all 4 Cores manually did however improve the ping time to ~0.2ms (without any big processes running) and then once i ran a ROS node the localhost ping went down to ~0.02ms. If i exited the ROS node, the ping would stay the same at ~0.02ms.
Enabling the cores manually will only work until reboot, so would you or anyone know how i could permanently set all CPU’s online and disable tegra_cpuquiet ?
I also looked at the other ARM board i have (Radxa Rock Pro) and when i looked at the /proc/interrupts folder, interrupts where distributed amungest all the cores like such:
I don’t know about the IRQ balancing question…it may or may not apply to this architecture. Something to note on a system which is truly capable of all CPU cores handling hardware IRQ (such as an IO-APIC enabled Intel x86 or desktop AMD CPU) is that driver interrupts for each individual piece of hardware (such as NIC, drive controllers, video, etc) can occur on any CPU core (though a certain amount of sticking to a single core has an advantage so far as caching is concerned in a case where cache has already loaded) without the same level of software support as needed to fake this. There may be some scheme where software assigns a specific driver IRQ to a specific core, but this is more of a bandage compared to hardware automatically making any core capable of handling an IRQ based on priority and core availability. It’s quite possible that a software-scheme must still see the IRQ on CPU0 and then hand it off…a hardware scheme should be able to simultaneously handle hardware IRQs on any core without stalling from one core handing off to another core.
I still do not know the answer as to whether the Cortex A15 requires the equivalent of an IO-APIC or not…without knowing that it isn’t possible to say what is needed to distribute hardware IRQs.
The difference between 0.02ms and 0.2ms is because of the CPU clock. If you lock the clocks as I mentioned in my previous post, you don’t see the variation in ping times.
I linked the performance wiki page in my previous post. The page lists one way to run commands on boot: edit /etc/rc.local. Just make sure it doesn’t exit on first error as setting already online cores to online will cause an error.
So the ping times do not stay the same, even if i manually set the frequencies. If i have no major processes running, i get ping times ~0.153ms, and then once i run a process like roscore which is the master communicator for the ros node system, they suddenly drop to ~0.025ms.
This however, still does not solve main issue. I listed this thread as a issue of ping times, because ros runs through IPC sockets. But i actual issue is that I am sending/receiving commands over USB (serial in actuality through /dev/ttyACMx) to my robot’s motor controller, and for some reason when i am sending a consistent stream of motor commands to the port, there are random stutters in the movement of the motors. I tested this with the same code i use to send commands through the serial port with the Radxa Rock as well as x86 computer, and never could reproduce this error on them, this only happens on jetson.
Now im a bit lost on what the actual issue is, since the ping times are low now, but I still can’t send commands consistently at say 10Hz.
Only thing i can think of now though, is how linuxdev mentioned that all the hardware interrupts run on one core. Im not sure if that has anything to do with my problem though. As i said before, on the Radxa Rock under /proc/interrupts, the interrupts are distributed across all 4 cores.
Anybody have any input on why I cannot send commands through a serial port consistently?
Unless there is buffering USB itself might stutter. If there is a USB HUB in between Jetson and the device this could add to the issue. And of course USB is something which also requires CPU0 to service interrupts.
I took out the USB hub a while ago thinking this may have been the issue, so this problem is arising with a straight connection between the motor controller and jetsons usb port.
Yes, that is correct if using something like top or renice to accomplish this. You can also start your program via the “nice” command if you are starting it as root.