Nvpmodel config

linuxdev · December 29, 2022, 6:36pm

In general, mostly you don’t want them reserved. Unless you have a purpose for doing so, then I’d remove the “isolcpus=1-2”. However, generally speaking, this won’t stop anything from working either way, it just changes performance.

It isn’t your original question, but it might be worth explaining some background related to all this. This goes beyond a simple answer, but instead of just saying to use or not use isolcpus you might find this interesting (ignore this if you don’t need to know why cores might be isolated).

First, the Denver cores do much the same as the other cores, but they have different characteristics. I think one reason they were added on the early models is because of the energy efficiency and flexibility. But overall, they are not that different from any other core.

So the question becomes one of which core to use and when. This takes you down the proverbial rabbit hole of schedulers and just what multitasking is. Any time you have more threads or processes than you have compute resources something has to wait while something else runs. Determining what runs and when is something end users rarely think about, but the scheduler does this constantly based on some rule.

A typical rule is that every process (I’ll not mention threads again, but threads are mostly interchangeable in the below when I say process) is given a time slice. It runs for that time, and then another process runs while the original is safely stored away wherever it left off. It takes time and memory to stop and store a process, and so it is best to not wildly change processes before significant work has been completed. Also, some parts of a process might need to be atomic, and interrupting them to store them is not practical; an example might be a process retrieving data from a hard drive…the hard drive is not going to stop and switch to reading some other sector and go back without error since it only reads in blocks. The scheduler is aware of atomic sections as well as having some rule for batching certain things together or in a row for efficiency. An example of batching is that cache RAM might be using the same memory among multiple threads of a single process gaining benefit by not invalidating the cache when moving to an entirely different process unrelated to those threads. The scheduler modifies schedules of execution based on concepts such as those.

The Denver cores might have increased latency in some respects, and yet they have benefits in other ways (such as low power consumption in low power models) for a given job. Disabling them for general computing and avoiding those cores can lower latency. However, if you were to run some particular process on those cores, you might find better power consumption and the same average performance despite the lower latency. Thus it could be of interest to use those cores specifically when you don’t care about increased latency, but do want decreased power. Plus they are extra cores, so you could do more work overall. But you might not use such a core for something like handling your mouse as a gamer.

Jetsons tend to be used for CUDA or other GPU-related specialty software. An RPi is fine with so many things, but often a Jetson user might be interested in putting their specialty software in the Denver cores while using the others for “regular” use. The scheduler does not have anything set up to be aware of this, but if you isolate those cores from the scheduler (that is what “isolcpus” does as a kernel argument…it configures an aspect of the scheduler so far as what it does in general), then those cores are reserved only for processes you’ve marked as having a CPU affinity for those cores. You can of course disable the isolcpus and just tell the scheduler to use the cores as if they are like any other core. If you don’t have any special performance objective, then I suggest just enabling the cores as it takes special setup of processes to use the isolated cores. Then you don’t need to do anything and the cores are just used without effort. If you are benchmarking though, consider that improvements in either latency or power consumption might be achieved when isolating cores and assigning CPU affinity to particular processes.

The following is even further from what your question is more detail on latency and classifications of schedulers. Before that I’ll state that in any system competition of threads or processes for compute resources might be termed “how nice the scheduler is to one process versus another process”. The nicer a process schedule is, the less that process gets when the competition needs time. One can be “less nice” to become king of the hill. Just be careful that something critical (e.g., disk drivers) doesn’t get a lower priority (disk drivers shouldn’t be too nice) than something that is atomic and won’t complete without disk data. It starts to remind me of time travel science fiction movies where there are unintended consequences when something in the past conflicting with something in the future creating a paradox (in this case, a priority inversion might occur).

There is ordinary scheduling, a plain vanilla scheduler which is what you are used to on the Linux desktop PC and what most of the Jetson is working with for ordinary software. Then there are soft realtime extensions. Many processes are marked as a default class which is ordinary (it is a “class” a process is marked with, and a policy enforced for that class for the scheduler). If you have soft realtime extensions, then processes can be marked for priority, and although ordinary fairness of competition between processes occurs, those of the realtime class are given some minimal time slice no matter how unfair it is to the other processes. The soft realtime schedule you’ll see a lot on a PC is the audio playback. Audio really must get a certain time slice in order to prevent stuttering. Simple batch play of audio is unacceptable.

However, what would happen if some critical atomic section of an important driver were preempted for audio? You might find the system crashing, or data corruption, so on. This is why that particular realtime is “soft” realtime. It isn’t guaranteed. In order to get hard realtime (and ARM actually has several subclassifications of hard realtime), where priorities are absolutely guaranteed for given time slices, you need hardware assistance. You probably need hardware support and smart programming to deal with some atomic code sections and scheduling algorithms (resources requirements for realtime scheduling goes up exponentially with the number of threads and the ARM Cortex-R series CPUs tend to have actual scheduling hardware which doesn’t exist on Cortex-A and Cortex-M…I really like Cortex-R). Also, although cache RAM improves average performance, the hit or miss nature means it is bad for hard realtime because hard realtime is really about enforced absolute latency, and not about average throughput. Hard realtime means lower performance, but an absolute guarantee that you get what you want without failure, ever.

The RT kernel extensions you can get with the ARM Cortex-A (which is what most of a Jetson is, although it does have some Cortex-R you can’t normally reach) are soft realtime. The presence of the Denver cores can be used to tune average performance to higher levels without consuming as much power as regular cores use, but latency might be worse. Whether that latency matters to you or not depends on your use, but for most users they won’t be aware of that latency. Someone performing benchmarks will notice. If latency becomes an issue under load, then this is when you consider working with CPU affinity and keeping the Denver cores isolated, but scheduled for specific classes of processes (which you’d have to manually set up).

Some URLs which might be of interest:

https://forums.developer.nvidia.com/t/threads-running-on-an-asymmetric-system-parker-soc-has-2-denver-cores-and-four-arm-cortex-a57/126242/5 (see the taskset content, this sets a process to a core).
https://forums.developer.nvidia.com/t/irq-balancing/126244/6
https://www.kernel.org/doc/Documentation/IRQ-affinity.txt (IRQ affinity docs from kernel.org, setting affinity does not require RT extensions, but if a process is the only one on a core, and there is no competition, then you have the ultimate realtime if you ignore cache hit/miss issues).
https://forums.developer.nvidia.com/t/cpu-at-0-performance/177769/3 (a bit more on taskset).

Just to reiterate, isolating a core and then setting a process to that core is one step for tuning performance. This in itself does not have a realtime effect other than because no other process is competing (but that’s a rather large performance advantage, especially since cache will never be invalidated by another process). A process that needs a lot of average performance without consuming as much power, and tolerant of some latency increase, would be a good candidate for Denver core affinity (and in many cases you wouldn’t care and could just schedule Denver cores like any other). Realtime itself implies setting classes to processes and informing the scheduler to use non-default priority algorithms.

Note that isolating cores does not disable cores, but it does tell the scheduler not to send anything there that is not marked for there.