Fun with `top`: CPU "scaling" for power-saving?

Interesting effects may be observed by using ‘top’ process and memory-usage monitor from commandline (perhaps in XTerm).

Running FlightGear.org flight-simulator which makes use of OpenSceneGraph and the NVIdia OpenGL, will reliably bring the system load up near to 3 or more.

Using ‘top’ on a 1-second update in an xterm, and having pressed “1” to see per-CPU statistics, I notice that when only the xterm is running in compiz/lightdm/Xorg, the load is less than one and the only CPU statistics are for CPU#0.

But as load increases much above 1, now we see statistics also for CPU#1. As load passes 2.0 or so, now we may see also statistics for CPU#2 and then also for CPU#3. Thus, all four CPUs are engaged in processing, but it seems that this is only at high loads.

For the Kernel Geeks:

I guess this is probably inherent in the stock kernel and it seems reasonable that this happens because the Tegra is really meant to be a Mobile system. However, as applies to the Jetson TK1 board, which is run from the mains rather than from a battery, this may not be the most desirable default behavior.

I’m not a really good C coder and also have not really looked into this issue in the kernel source. Maybe I am lazy. But maybe some of those who are concentrating on the kernel tweaking will know, or can learn, if it is possible to alter the CPU scaling to aim for full-time operation of all 4 CPUs with the object being an initial layer of parallel processing useful for sorting out what should be sent to CUDA versus what is better handled in a sequential rather than parallel fashion, etc., or, which may be easily handled within the multiple CPU without any overhead associated with offloading it to the CUDA system. (Possibly I am not saying this well, for which I apologize.)

More-or-less, is it possible to run all 4 CPUs fulltime when connected to the mains, rather than always operating in a power-saving mode perhaps best suited for battery operation. Heat problems, maybe?

You can change it with something like cpufreq utilities, and set the governor to performance, that should run all 4 cores at all times. I haven’t seen any issues with running them for long periods of time myself wrt heat - I’ve built a Gentoo rootfs in a chroot on there, and it took roughly 203 minutes of straight compiling to rebuild everything in @system (which is everything that a base Gentoo install needs).

hth

The system load is the number of threads in “runnable state”. I don’t really know all the details but e.g. when using a slow disk, all processes may be waiting for the disk and the load goes up, even though they are not actually using the CPU.

If you do have 4 threads or processes that are all really using actively the CPU, the CPU cores should come up in milliseconds (they probably go down slower).

You should be able to list all available governors with:

cpufreq-info -g

And as stated by steev above, you should be able to set governor to performance (might need sudo):

cpufreq-set -g performance

Those are generic instructions but I think they should work also on Jetson.

You can also try the echo commands in the wiki:

http://elinux.org/Jetson/Jetson_TK1_Power#Maximizing_CPU_performance

@Both of you: many thanks!

I installed “cpufreq-utils” and have given a try to such things as cpufreq-set -g performance in general and on a CPU-specific basis.

(I should mention that I tried installing package “cpufreqd” and it doesn’t work right; despite significant configuration modification and so forth, it locks the whole system to the lowest possible speed of 51MHz, which might be useful on spacecraft running on solar power out beyond the orbit of Neptune but isn’t helpful to me. Trying to set the variable in either configuration files or manually results in an error about the socket not being found.)

When any specific core isn’t running, I get the following error message from cpufreq-utils:

cpufreq-set -g performance -c2 Error setting new values. Common errors: - Do you have proper administration rights? (super-user?) - Is the governor you requested available and modprobed? - Trying to set an invalid policy? - Trying to set a specific frequency, but userspace governor is not available, for example because of hardware which cannot be set to a specific frequency or because the userspace governor isn't loaded?

Clearly, the system still brings up specifc cores on demand, though when they do come up, it seems each newly activated core inherits the setting for CPU0; they will all read as using governor “performance” and all show 2.32GHz as the speed.

This is likely to be a gaming issue going forward and should be understood and provided for fixes/workarounds.

An example: This doesn’t seem to affect the GPU or CUDA or anything about the display system. On FlightGear flight simulator, once loaded, the graphics are impeccable and extremely photorealistic. When any given number of cores are running, motion is generally smooth and life-like. However, when core usage modes change, for example when on approach to a landing, every time a new core starts up, latency of at least a half-second and up to several seconds can occur. While the scene rendering is still lovely and highly detailed, the frame rate is not helpful to the realism of simulation.

I know that this can work well, because a low flat approach in simulation gives enough time for all cores to get working and for all of the level of detail to be loaded for the scene and sent to the graphics system. This looks about like a well-rendered production television animation. The bottleneck seems to be that the working core(s) is overtaxed, another core starts up, and the system for division of labor causes pauses.

@kulve: this might be less noticable or less of a problem if I was using a SATA internal drive plugged into the board port. Some of this may in fact be due to USB congestion. However, watching top showing the various CPUs turn on and off leads me to believe that it may be as much due to core startup-powerdown issues as to USB or drive-latency issues.

Regards,

One thing which is rarely obvious (and dependent upon architecture) is that sometimes “cpu affinity” intentionally favors fewer cpu cores for performance reasons. I have no idea if this is tied in to power savings as well, but cache memory which already has the correct content is far less expensive than flushing cache and looking at main memory to find what is needed. It’s rather common to stick to a single cpu to avoid a cache miss; alternating cpu cores when cache on a single core is a hit would almost certainly destroy performance even with more cores working. Possibly as the load goes up more independent threads are being executed, thus cache hit/miss is not an issue for new threads. Even if a single cpu core is all that shows up under low load, you would have to understand what cache and cpu affinity is doing to know if it is simply for power saving or if it happens to be the best choice for performance.

So perhaps nVidia could answer the questions: Was there any kind of intentional cpu core affinity for performance reasons? Was there any kind of intentional cpu core affinity for power saving reasons?

@linuxdev: Thank you for the detailed explanation. It certainly provides the right words to the picture I was forming.

“CPU affinity” would presuppose that there is no external cache as on early x86 systems (IIRC); thus each core would have its own internal cache and probably the contents of that cache could not be easily passed over to other cores. Else, there might be an external cache but it isn’t “segmented” so as to be read from, much less written to, by later cores in a core-specific manner. If such read-read/write access were possible it could be possible to populate each cache with data, initially copying (and updating) the contents of the CPU0 segment of external cache to the segments reserved for the other cores. This constant copying and updating would definitely cause a performance hit… except at the exact moment the other cores came up and would find up-to-date data for processing. At least their startup would be pretty seamless.

I should probably refrain from wild speculation here or from writing science-fiction not actually based in science. ;-; Although it would be interesting and desirable for high-performance rack-mount servers, it’s unreasonable to expect any such architecture on what is essentially a mobile chip stuck onto a mid-range graphics card. Yet as such small/embeddable systems become the mainstream, it might make sense for future multi-core board architectures to consider going back to the future with external cache.

And yes, NVidia, we would love more answers from you.

You may find the ARM architecture references useful; much detailed cache information is available there. nVidia makes a reference to this web site in regards to the JTAG pin-out, but detailed information is also available on cache:
https://silver.arm.com/

I do not include any specific document, as you have to create a login there. The ARM Cortex A15 is an ARM7 architecture, so that is where I’d suggest starting. If you look at /proc/cpuinfo, you will see model ARMv7, rev 3, v7l for the A15.

I should mention that I seem to have significant improvement using a package called “tlp”.

sudo tlp ac start

helps quite a bit, I think.

Thanks for the refs!