Hi everyone,
I’m currently testing the latency performance on my NVIDIA Jetson Orin Nano using cyclictest, and I’m seeing an interesting anomaly when measuring latency across different CPU cores.
I executed the following cyclictest command where I changed the affinity to different pairings of cores: cyclictest --mlockall --threads --mainaffinity 0 --affinity 4,5 --priority 90 --distance 0 --duration 300 --histogram=1000 --histfile=cyclictest.txt
When inspecting the output, I observe that cores 0, 1, 2, and 3 consistently show lower latency values (around 25us) whereas cores 4 and 5 exhibit higher latency (around 40 us).
Does anyone have an explanation for this? If I’m not mistaken, the 6-core ARM Cortex-A78AE CPU is build as a combination of a quad-core and a dual-core CPU. Is this correct and could this cause the difference in latency? I would think core 0, 1, 2 and 3 are part of the quad-core cluster with another power policy in comparison to the dual-core cluster with core 4 and 5. This is just an assumption though.
If it helps, this is the visualisation of my cyclictest data if I place affinity on cores 1 to 5.
Thanks in advance!
Hi,
Please share the full steps of doing the profiling. We will set up Orin Nano developer kit with Jetpack 6.2 and check.
Hi,
To achieve this plot, a Linux system with the PREEMPT_RT patch is needed. In our tests, we made our own image with the Jetson BSP and kernel 6.12. In general, we are just interested looking to compare results with other people who tried the PREEMPT_RT patch on the Jetson Orin Nano, or some opinions on this results.
Looking at the technical reference manual of the Jetson platforms, I would argue that the dual-core cluster of the Jetson Orin Nano would perform better because it has it’s own (L1, L2 and L3) cache and seems to operate at the maximum frequency of 1,5GHz all the time. Any thoughts on this?
Hi,
Would like to confirm the steps, so you run the command:
$ cyclictest --mlockall --threads --mainaffinity 0 --affinity 4,5 --priority 90 --distance 0 --duration 300 --histogram=1000 --histfile=cyclictest.txt
And the generated cyclictest.txt shows core 4 and 5 have higher latency. Is the understanding correct?
Also do you observe it on default Jetpack 6.2, which is with kernel 5.15?
You will always have this on any system running linux kernel. All the RT patch does is make your outliers and “latency” more predictable. Short side is when you constrain the outliers you will bottleneck other services and create issues with your desktop management and sometimes grahics display.
If you need a very deterministic system you will need to dump the kernel and use something like the pi pico or similar.
That is true, but we also observe this discrepancies between core 0,1,2,3 (cluster 1) and core 4,5 (cluster 2) if we just test and isolate core 4 and 5. This means that the cores in the first cluster are used to handle interrupts and other services, while the second cluster takes the realtime work. The second cluster seems to be on the maximum frequency by default and never put to sleep (developers of nvpmodel/bpmp firmware, correct me on this?), so I would expect that cores to be significantly faster than the others, and that’s where the strange things start to happen (see plot above).
Yes, so much depends on coding that is extremely complex, bad part is all you will do is chase your tail patching this and patching that. I will look for outliers in the overall system and make decisions based upon that. It comes down to just how deterministic your design must be.