Threads running on an Asymmetric System : Parker SOC has 2 Denver Cores and four ARM Cortex A57

I’ve written a performance critical piece of code that relies on ARM NEON instructions for performance. The target hardware is Magic Leap (ML1) which is actually a Parker SOC. I have to thread the application to maintain performance but I’m getting random performance depending on which worker thread in the Unity game engine gets the job. When assigned to thread 2, my code runs in 29-33ms. When assigned to thread 1, it runs in 57 ms.

My theory is that thread 2 happens to be on an ARM core with NEON and thread 1 is on a Denver core without NEON. What is expected to happen when code that is compiled with NEON runs on cores that do not have NEON? The code does run so there must be emulation going on. Is it supposed to be a feature of an OS that somehow puts NEON code on cores that support NEON or is that something a developer has to schedule manually?

I don’t know if the Denver cores do not have NEON or not, but the ARMv8-a spec makes NEON mandatory, so I doubt this is the issue. The optional extensions, such as NEON and floating point of older 32-bit ARMv7-a are not optional in 64-bit ARM. However, something you quite possibly are running into is cache performance. If you migrate across cores, then you will most likely get cache misses on that core the first time you hit the core. Operating on the same core tends to get cache hits.

If other processes are operating on a core, and those processes update cache, then it is possible that even if you operate on a single core there will still be a cache miss after the other process replaces the cache.

It might get complicated, but you could consider assigning core affinity, along with denying a core to other processes (never do this with CPU0, the first core…it handles hardware interrupts).

About cgroups and core affinity:

Note that if you have gone far enough to set up affinity you can also set up priority.

This wikipedia article says that NEON wasn’t in Denver, but it was added in Carmel later.

But thanks because you’ve given me plenty to look into. I will look into core affinity.

Neon/ASIMD are supported on all cores in Parker.
You can use ‘taskset’ command to bind or set affinity of the process to specific cpu and then run test.
If problem still comes, then please share the results on both ARM and Denver cores along with details about which neon instructions used.

# cat /proc/cpuinfo | egrep 'processor|implementer|Features'
processor       : 0
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
processor       : 1
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x4e
processor       : 2
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x4e
processor       : 3
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
processor       : 4
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
processor       : 5
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41

Note about binding tasks to respective CPU’s for better perf:
Smaller tasks have better perf on ARM cores and bigger tasks have better perf when run on Denver cores.