PThread's CPU Affinity on Jetson TX1

I am developing multi threaded application for Jetson TX1 platform.I am using pthreads for the same.

I have total 4 threads running so What i have observed is if i don’t set CPU affinity for each thread then jetson’s scheduler schedules each thread on different CPUs varies from 0 to 3(I am printing current cpu id using API sched_getcpu()).

Now if i set CPU0 ,CPU1 ,CPU2,CPU3 for each one of four threads to run using API pthread_attr_setaffinity_np then affinity is set for each thread , i confirmed it by printing cpu id in each thread…

I profiled my application in both the case and i found performance(end to end latency per video frame) is less in first case so i think default scheduler is scheduling each Pthread in a better way as compared to setting affinity for each thread…

Any more insight is highly appreciated…

I do not have an answer for you, but I do have a related observation. Thread affinity used for performance reasons has much to do with whether cache is hit or miss…tying a thread to one CPU in theory means the cache can always be hit. However, if other processes on the system are serviced (and many are), then one of those other processes could have made the cache useless the next time that thread runs. If that is the case, and if another CPU core were available, it would actually be faster to let the scheduler use an immediately available CPU (versus waiting for the assigned core).

FYI, only CPU0 services hardware interrupts, and so most of your drivers can go only here. It would be interesting see what happens if you used affinity with only cores 1 through 3, leaving core 0 alone; or perhaps just use cores 3 and 4.

Yes i understood your point.

I thought of setting affinity for each thread to each CPU to avoid CPU switching for each thread…
So doesn’t this CPU switching for each thread adds some time ?

The threads already exist, so creation time is not in question…only switching context. For cases where there is a cache miss, context switching won’t care about which CPU it goes to, each core is basically equal. The registers and data which have to be saved as a thread goes out of context, followed by the registers and data to be loaded when going back into context, are fairly equal regardless of core. That cache though is a huge benefit when it can be preserved and used without re-loading cache. If a thread goes to a new core, it cannot use cache; but if other processes are causing the cache to change, you will also get a cache miss.

Ok great so it looks like it’s better not to assign CPU affinity on JTX1 if we are concerned about performance…

What’s best tends to be a moving average, depending on the nature of your program and of everything else running.

Consider the case of a system with just one core, and running just one process, but swapping out different threads. The cache has a limited size, and it is possible that each thread needs the same data, although you can only guarantee some parts of threads have common data. As soon as one of your own thread requires data not in cache, expect the cache to be flushed as the required data is loaded (and invalidating cache for the previous thread…each process thrashes the cache). Cache tends to be loaded in blocks or lines of addresses, so even the parts in common may get flushed and re-loaded among two threads of a single process. The nature of your own data use can change how cache hits work, e.g., if you run two threads accessing the same memory region (such as some shared memory operations) on the same core, you will get mostly cache hits (and if it is a miss, a single load satisfies both threads, halving the overhead)…if those two threads are assigned to separate cores and changing data, you are guaranteed that cache is mostly a miss on both cores with maximum overhead. So there is the question…are your own threads operating on different data? Are your own threads sharing data? This might mean two threads on one core is best, or two threads on separate cores is best.

Consider asymmetric processing (“AMP”, versus “SMP” symmetric multi-processing of Jetson and multi-core desktops). In asymmetric models, different cores can be dedicated and not available to everyone. The old VAX machines do this. It may make perfect sense that a core is dedicated to a single function, such as disk drive access when thousands of users may all need data, and disk access is a bottleneck. Dedicating all cache and buffer in one core to one function is as good as it gets to making use of cache. In a way, the rise of GPUs dedicated to video graphics does this…it is a specialty function because you know for certain that this function is a bottleneck and dedicated hardware will improve performance. Even so, a single dedicated core could still have cache misses if more data is required than what the cache can hold…so add more cache. The ultimate AMP cache is the video card, where you might have 2GB or 4GB of very fast RAM.

More cache is better. Or maybe not if power consumption or physical size or cost matters. Server CPUs cost more in many cases, use more power, and will have more cache than a typical desktop system. Server CPUs consume more power at the same clock speed, and higher speeds require more power for a given clock speed increase…if no cache were present the job would get done anyway, just slower…you could choose to reduce power, cost, heat and size by not including cache. Imagine if Jetson had 2GB of dedicated RAM similar to a video card, in addition to system RAM…the power consumption would go off the charts. Code would need to be re-written to optimize the difference between the power-efficient “everything in system RAM” versus transfer to and from video/GPU RAM and system RAM over PCIe or other memory controller. In the case of video games, you’ll see software developers trying hard to keep texture memory requirements within the dedicated “cache” of the asymmetric multi-processor…the video card…as soon one starts getting a “cache miss” because the texture memory cannot be held in the video card, performance dies, and memory must be swapped. Now imagine that this same video card serves video processing for hundreds of video programs at the same time, and you start seeing the issues of SMP…everything uses that same memory, and the size is finite. It is a fact that any general use computer having the ability to run programs the hardware vendors do not know can make their device more useful by providing CPU cores which are generally available for end users to do with what they want. Once a core becomes part of SMP, it’s up to a combination of the scheduler and how each program is designed to make use of those cores. You can tweak your own program, generally you cannot tweak the other programs running on those cores; the hardware vendor cannot tweak either of those…the vendor could add more cache and tweak the scheduler, but nothing else.

CPU core 0 is a special case, as it is a hybrid of AMP and SMP. I say this because software interrupts can run on any core, but hardware interrupts can run only on core 0. This means devices connected to the system and generating an IRQ, regardless of integration on the SoC or the surrounding circuit board, depend on core 0. On an Intel format desktop machine, there is a device to balance hardware interrupts to all CPU cores…the SMP I/O APIC (asynchronous programmable interrupt controller). ARM was not designed to use this. It seems that none of the small devices intended for use in smart phones and tablets or similar devices have that ability. AMD’s Opteron CPUs have memory and other architecture differences which allow balancing of hardware IRQ loads across all CPU cores without an I/O APIC (which has its own tradeoffs). So if you do something to increase the load on CPU 0, it is possible you will also decrease performance of networking, USB, video, so on. That core will certainly have more pressure to clear the cache and start over when forcing it to handle more load.

The scheduler gives you some ability to deal with this. There is some attempt by the scheduler to keep a process or thread on a core where it thinks there will be a cache hit. The scheduler may attempt to use the first available core if it does not know of a performance reason to wait (manually setting affinity is a reason to wait). In the JTX1 ARM CPU, the nature of the cache itself became more sophisticated compared to prior ARM CPUs, so this also helps even when the scheduler does not know what to do (this is a fairly significant improvement too). You can give process IDs a higher priority or lower priority using “nice” or “renice” commands as root (or sudo). If you were to assign a high priority in general, there may be times when this interferes with the scheduler doing what is required for hardware IRQ servicing on core 0…but if your threads were assigned only to cores 1 through 3, a higher priority would at least give your application lower latency when trying to get a core to service it (other software would suffer a penalty on those cores).