What’s best tends to be a moving average, depending on the nature of your program and of everything else running.
Consider the case of a system with just one core, and running just one process, but swapping out different threads. The cache has a limited size, and it is possible that each thread needs the same data, although you can only guarantee some parts of threads have common data. As soon as one of your own thread requires data not in cache, expect the cache to be flushed as the required data is loaded (and invalidating cache for the previous thread…each process thrashes the cache). Cache tends to be loaded in blocks or lines of addresses, so even the parts in common may get flushed and re-loaded among two threads of a single process. The nature of your own data use can change how cache hits work, e.g., if you run two threads accessing the same memory region (such as some shared memory operations) on the same core, you will get mostly cache hits (and if it is a miss, a single load satisfies both threads, halving the overhead)…if those two threads are assigned to separate cores and changing data, you are guaranteed that cache is mostly a miss on both cores with maximum overhead. So there is the question…are your own threads operating on different data? Are your own threads sharing data? This might mean two threads on one core is best, or two threads on separate cores is best.
Consider asymmetric processing (“AMP”, versus “SMP” symmetric multi-processing of Jetson and multi-core desktops). In asymmetric models, different cores can be dedicated and not available to everyone. The old VAX machines do this. It may make perfect sense that a core is dedicated to a single function, such as disk drive access when thousands of users may all need data, and disk access is a bottleneck. Dedicating all cache and buffer in one core to one function is as good as it gets to making use of cache. In a way, the rise of GPUs dedicated to video graphics does this…it is a specialty function because you know for certain that this function is a bottleneck and dedicated hardware will improve performance. Even so, a single dedicated core could still have cache misses if more data is required than what the cache can hold…so add more cache. The ultimate AMP cache is the video card, where you might have 2GB or 4GB of very fast RAM.
More cache is better. Or maybe not if power consumption or physical size or cost matters. Server CPUs cost more in many cases, use more power, and will have more cache than a typical desktop system. Server CPUs consume more power at the same clock speed, and higher speeds require more power for a given clock speed increase…if no cache were present the job would get done anyway, just slower…you could choose to reduce power, cost, heat and size by not including cache. Imagine if Jetson had 2GB of dedicated RAM similar to a video card, in addition to system RAM…the power consumption would go off the charts. Code would need to be re-written to optimize the difference between the power-efficient “everything in system RAM” versus transfer to and from video/GPU RAM and system RAM over PCIe or other memory controller. In the case of video games, you’ll see software developers trying hard to keep texture memory requirements within the dedicated “cache” of the asymmetric multi-processor…the video card…as soon one starts getting a “cache miss” because the texture memory cannot be held in the video card, performance dies, and memory must be swapped. Now imagine that this same video card serves video processing for hundreds of video programs at the same time, and you start seeing the issues of SMP…everything uses that same memory, and the size is finite. It is a fact that any general use computer having the ability to run programs the hardware vendors do not know can make their device more useful by providing CPU cores which are generally available for end users to do with what they want. Once a core becomes part of SMP, it’s up to a combination of the scheduler and how each program is designed to make use of those cores. You can tweak your own program, generally you cannot tweak the other programs running on those cores; the hardware vendor cannot tweak either of those…the vendor could add more cache and tweak the scheduler, but nothing else.
CPU core 0 is a special case, as it is a hybrid of AMP and SMP. I say this because software interrupts can run on any core, but hardware interrupts can run only on core 0. This means devices connected to the system and generating an IRQ, regardless of integration on the SoC or the surrounding circuit board, depend on core 0. On an Intel format desktop machine, there is a device to balance hardware interrupts to all CPU cores…the SMP I/O APIC (asynchronous programmable interrupt controller). ARM was not designed to use this. It seems that none of the small devices intended for use in smart phones and tablets or similar devices have that ability. AMD’s Opteron CPUs have memory and other architecture differences which allow balancing of hardware IRQ loads across all CPU cores without an I/O APIC (which has its own tradeoffs). So if you do something to increase the load on CPU 0, it is possible you will also decrease performance of networking, USB, video, so on. That core will certainly have more pressure to clear the cache and start over when forcing it to handle more load.
The scheduler gives you some ability to deal with this. There is some attempt by the scheduler to keep a process or thread on a core where it thinks there will be a cache hit. The scheduler may attempt to use the first available core if it does not know of a performance reason to wait (manually setting affinity is a reason to wait). In the JTX1 ARM CPU, the nature of the cache itself became more sophisticated compared to prior ARM CPUs, so this also helps even when the scheduler does not know what to do (this is a fairly significant improvement too). You can give process IDs a higher priority or lower priority using “nice” or “renice” commands as root (or sudo). If you were to assign a high priority in general, there may be times when this interferes with the scheduler doing what is required for hardware IRQ servicing on core 0…but if your threads were assigned only to cores 1 through 3, a higher priority would at least give your application lower latency when trying to get a core to service it (other software would suffer a penalty on those cores).