Very high core to core CAS latency

As we are running the core-to-core latency benchmark, the Jetson Xavier exhibits very high core-to-core latency (~1000ns)

core-to-core-latency/src/bench/cas.rs at main · nviennot/core-to-core-latency · GitHub is the testing code, where basically two threads spinning Compare and Swap (CAS) untill success in loop and measure the time

This is running on nvidia’s devkit with vanilla jetpack, with MAXN power settings, and we tried on different versions of Jetpack (35.5.0, 35.4.1, 35.2.1), also on different jetson xavier modules as well but they all exhibit similar results as shown below

However, we don’t see this bad results on Jetson Orin
image

Any idea why this could happen?

Hi,
Do you compare Xavier and Orin in same Jetpack 5.1.3? If software version is identical, the deviation may be from the hardware capability.

For Orin we only tried 5.1.2, as it gives good results we didn’t try other versions, for Xavier we have tried 5.1.2, 5.1.3 and 5.1, where they all give bad results

1000ns+ for hardware capability still sounds a very huge number to me though, so I am suspecting this could be some software or kernel related
based on the results here GitHub - nviennot/core-to-core-latency: Measures the latency between CPU cores even the CPU from 2003 (IBM PowerPC 970, 1.8GHz, 2 Cores, 2003-Q2) is 600ns

I am curious, are all cores doing this simultaneously? Or are you testing a pair of cores at any given time, and then moving on to other core pairs?

This is testing pair by pair, e.g. testing between core 1 to 2, 1 to 3, … 1 to 8 and then 2 to 1, 2 to 3 etc.

Do beware that the first CPU (I call it CPU0, but in your chart it is CPU 1) is used for hardware interrupts. The other cores are using soft IRQs. Any case which loads down the first core has a chance of slowing all of the other cores. In this case I don’t think it is an issue because you are not accessing hardware drivers (the lock spin doesn’t need the disk, it doesn’t need ethernet, so on). However, if for some reason not presented here the first core is under load, then it could have an effect on your test due to making hardware drivers wait also.

This isn’t exactly what you are asking, and isn’t really part of your test, but you might want to quickly examine “/proc/interrupts”. This is a list of hardware IRQs. You’ll notice that almost everything requires CPU0. Each core has timers, but it will be possible at times in an incorrect test setup to have another core depend on a hardware driver from CPU0 (I don’t think your test would have this issue because you only work a pair of cores at a time).

I also just tried to use the rt kernel on Xavier, which improves the latency significantly, this makes me think it is something related to the kernel that causes the high delay in the regular kernel

It is likely related to the kernel. I doubt the hardware design is that different between any of the Orin models to have such an effect on one model, but not on the other. Understand though that the RT kernel does not magically reduce all latency. There is also software configuration of cgroups that determines how the RT scheduling modifies operation timing. I’m the wrong guy to do it, but if you can see what kind of configuration differences there are between slow versus fast latencies, it might offer a clue. For example, the file “/proc/cgroups” might offer clues if you can find differences between the operation of two kernels and go through each line item one at a time and look up what configuration might change that line item.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.