Unexplained CPU latency spikes

I’m evaluating Jetson Xavier for realtime control for robotics, and maximum latency at idle load is the first and foremost metric to measure.

The setup is CPU shielding instead of the PREEMPT RT patch, i.e. isolating a core from other user processes and kernel scheduling to run a single dedicated process. This is based on the NO_HZ_FULL patch introduced in kernel 3.10 and some followup work (https://lwn.net/Articles/744753/) that I backported from 4.19 to 4.9. The kernel is flashed with jetson-xavier-maxn. The kernel parameters include “isolcpus=nohz,domain,6-7 nohz_full=6-7 rcu_nocbs=6-7”, and after booting, the following steps are used to run a dedicated process on CPU 7.

nvpmodel -m 0
jetson_clocks
for D in $(ls /proc/irq); do
  echo 0 >/proc/irq/$D/smp_affinity_list
done
sysctl vm.stat_interval=120
sysctl -w kernel.watchdog=0
echo 1 >/sys/bus/workqueue/devices/writeback/cpumask
taskset -c 7 ./a.out

I have verified with ftrace that the process running on the isolated CPU is not interrupted by the kernel until it exists.

I was first testing Ethernet latency (eth IRQs were pinned to CPU 6) and it averaged at 40us after some tuning, but the latency sometimes spiked to more than 100us, and with the system under load sometimes it could be as high as more than 1000us. I checked with ftrace’s function_graph tracer and found the same kernel path could have very different latencies during the spike.

I ran another simplified test program to detect 10us+ latency between two timestamps in a tight loop:

#include <stdio.h>
#include <stdint.h>

__attribute__((always_inline))
static inline int64_t get_clock_ticks() {
	int64_t value;
	asm volatile("mrs %0, cntvct_el0" : "=r"(value));
	return value;
}

static int64_t get_clock_frequency() {
	int64_t value;
	asm volatile("mrs %0, cntfrq_el0" : "=r"(value));
	return value;
}

#define NS_PER_SEC (1000 * 1000 * 1000)
#define THRESHOLD_NS (10 * 1000)

int64_t ticks_per_sec;
int64_t ns_per_tick;

__attribute__((noinline))
static void print_latency(int64_t diff, int64_t now) {
	int64_t latency_ns = diff * ns_per_tick;
	int64_t time = now * ns_per_tick;
	fprintf(stderr, "Latency of %ld ns @ %ld.%09ld\n", latency_ns,
		time / NS_PER_SEC, time % NS_PER_SEC);
}

int main() {
	ticks_per_sec = get_clock_frequency();
	ns_per_tick = NS_PER_SEC / ticks_per_sec;
	int64_t threshold_ticks = THRESHOLD_NS / ns_per_tick;

	int64_t prev = get_clock_ticks();
	for (;;) {
		int64_t now = get_clock_ticks();
		int64_t diff = now - prev;
		prev = get_clock_ticks();
		if (diff > threshold_ticks) {
			print_latency(diff, now);
			prev = get_clock_ticks();
		}
	}
	return 0;
}

This program was getting ~27us latency spikes irregularly.

Latency of 105376 ns @ 3835.054541664
Latency of 11200 ns @ 3835.055359712
Latency of 17952 ns @ 3835.057610944
Latency of 52992 ns @ 3835.058487296
Latency of 32192 ns @ 3835.101519680
Latency of 30176 ns @ 3835.151701504
Latency of 27808 ns @ 3835.278135552
Latency of 26272 ns @ 3835.343975968
Latency of 29440 ns @ 3835.515327616
Latency of 27872 ns @ 3835.618776448
Latency of 27392 ns @ 3835.704467264
Latency of 27584 ns @ 3835.867465056
Latency of 26816 ns @ 3835.967778336
Latency of 27520 ns @ 3836.134952736
Latency of 27840 ns @ 3836.276010432
Latency of 27072 ns @ 3836.694932416
Latency of 26848 ns @ 3836.910163200
Latency of 27360 ns @ 3836.992716672
Latency of 27296 ns @ 3837.542239008
Latency of 29536 ns @ 3837.706278336
Latency of 28256 ns @ 3837.815995648
Latency of 28512 ns @ 3837.861990304
Latency of 27904 ns @ 3837.989465856
Latency of 27584 ns @ 3838.366612896
Latency of 26880 ns @ 3838.707202432
Latency of 27872 ns @ 3839.142848256
Latency of 28416 ns @ 3839.897066976
Latency of 27904 ns @ 3840.016187104
Latency of 27072 ns @ 3840.139482944
Latency of 26816 ns @ 3840.353663232
Latency of 27136 ns @ 3840.565756224
Latency of 27136 ns @ 3840.710993600
Latency of 27008 ns @ 3840.753850752
Latency of 27296 ns @ 3840.956540480
Latency of 27008 ns @ 3841.088195968
Latency of 27104 ns @ 3841.279396096
Latency of 26560 ns @ 3841.341053024
Latency of 29760 ns @ 3843.566000224
Latency of 26912 ns @ 3843.715414080
Latency of 26912 ns @ 3844.143750144
Latency of 27904 ns @ 3844.413297376
Latency of 27552 ns @ 3844.543907232
Latency of 26432 ns @ 3844.687052512
Latency of 27232 ns @ 3844.833331424
Latency of 26944 ns @ 3845.081985504
Latency of 27328 ns @ 3845.184390976
Latency of 26848 ns @ 3845.278436576
Latency of 26688 ns @ 3845.389194848
Latency of 26656 ns @ 3849.192484832
Latency of 28608 ns @ 3849.279221216
Latency of 27072 ns @ 3849.350284288
Latency of 27264 ns @ 3849.721166144
Latency of 27392 ns @ 3849.956239296
Latency of 27072 ns @ 3850.182955712
Latency of 27712 ns @ 3850.434746688
Latency of 27424 ns @ 3850.722051264
Latency of 28640 ns @ 3850.921606432
Latency of 28128 ns @ 3851.282037536
Latency of 26912 ns @ 3851.351010880

The disassembly proves the timing loop is very simple.

0000000000000690 <main>:
...
 6cc:	d53be042 	mrs	x2, cntvct_el0
 6d0:	d53be041 	mrs	x1, cntvct_el0
 6d4:	cb020020 	sub	x0, x1, x2
 6d8:	d53be042 	mrs	x2, cntvct_el0
 6dc:	eb00027f 	cmp	x19, x0
 6e0:	54ffff8a 	b.ge	6d0 <main+0x40>  // b.tcont
 6e4:	94000045 	bl	7f8 <print_latency>
 6e8:	d53be042 	mrs	x2, cntvct_el0
 6ec:	17fffff9 	b	6d0 <main+0x40>

It seems to me the CPU is somehow not doing work during the spike (or doing some other work instead of running the code), and if the code path is longer the latency can multiply proportional to the length of the code path creating even larger latency spike. As you can understand, latency spikes destroy realtime safety guarantee.

Now, the question is what is the source of the latency spikes and what can I do about it?

I can’t answer, but you will want to also compile this for profiling and find out if there is a particular location in the code which is spiking, versus whether the location of code with the spike is random (not sure how easy that will be since profiling breaks it down by system calls and it would be nice to see assembler which isn’t going to happen). My thought is that if you can show the delay is at random points versus some specific location you can at least attribute it to some specific system call versus the system as a whole (e.g., a cache miss might behave differently than waiting on a resource…a resource wait is probably consistent, whereas a cache miss might be more random). Do beware that this is not hard realtime.

Please note that the latency is produced in a loop of 5 instructions (from 6d0 to 6e0 in the disassembly) without memory or cache access (or system call, I already noted that the kernel is not involved here). I don’t think profiling would help here.

mrs x, cntvct_el0 reads the Counter-timer Virtual Count register.

You are probably right, but despite your process not needing memory or cache, some other process could touch memory or cache and result in operations excluding your process during that time. I use the term “process” loosely here since some kernel code may execute which does not show up as a process or thread in any usual way. The direct goal is to know how to make the timing deterministic, but indirectly the goal is to find out what else is touching the core which occasionally prevents core access by your process. Sorry, unless you can profile in a JTAG debugger down to the assembly I don’t have much else to offer (and JTAG debuggers are not known to be available for Xavier). However, if NVIDIA reproduces this, and if they have a way to profile under JTAG, then it is a different story.

The setup of the kernel already prevents other processes or the kernel from interrupting the test program. That’s the whole point of CPU shielding.

Can any Nvidia developer offer a hint on what could be happening here, or how to approach this? Could it be that the firmware is stealing time from CPU to handle thermal events or something else under the hood? Or the CPU itself is withholding cycles to handle frequency changes?