Suggestions for reducing idle power consumption?

At the moment, my Sparks are consuming ~40W of power each at idle. That seems high for an ARM-based platform (for comparison, I’m currently running 4 ARM-based RK1 modules at 14W idle). Obviously, the Blackwell chip needs power too, but I’m wondering if there are any system parameters I can tune to reduce the idle power draw, while still keeping the system performant under load.

Of course, I can (and do) power them off when not in-use, but I’d prefer a “soft” fix.

1 Like

What are you using to measure power consumption?

Anyway I did my own hypothetical reductions based on prior knowledge.

First thing is looking at CPU frequencies and finding out that they basically never go to min frequency, you can fix that by doing
printf “w /sys/devices/system/cpu/cpu*/acpi_cppc/energy_perf - - - - 1\n” > /etc/tmpfiles.d/efficiency.conf

or

echo 1 | sudo tee /sys/devices/system/cpu/cpu*/acpi_cppc/energy_perf

Values are between 0-255, from perf → power efficiency, you can play around with values but it defaults to 0, though I did not record the data intensively, but 0 is equivalent to max frequency (performance) and 255 to min frequency hard lock (from quick 2-4 second testing).

Anyway from my playing around going from 255 → 0 is generally preferring to ramp performance cores up first so don’t worry to much about it in the context of idle watts and just set it to 1.

I did some other things that effect average performance vs power efficiency but for idle power consumption it should be equivalent to just the above.

Nominally this would be the same as the powersave scaling governor but it does nothing on my end. I spent a couple hours jumping through the web to find that sysfs.

I’ve got my Sparks connected to a smart power strip (HS300), then I’m pulling power readings per plug into Prometheus every 15 seconds. Probably not 100% accurate, but good enough for relative measurements and not bad for $40!

hmm, this didn’t seem to have any effect. I tried both 1 and 255 and didn’t see any significant change in power draw (the “big” spike is from my SSH session and poking around).

Oh that’s rough, thought it was the CPU not idling since max GPU watts through nvidia-smi was ~50-60 which would make the rest of the package ~150W.

Guess that 40W is coming from elsewhere then, might be worth running sudo powertop --auto-tune just to see if that shaves off anything. Maybe the ConnectX7 port is the offender?

Looks like it might be 5W+ if this is anything to go by, Connectx-4 Lx 2x 25GbE with ASPM support + idle power consumption measurements - Workstations & Servers - Level1Techs Forums .

powertop showed a few tunables that were “bad”, but --auto-tune didn’t have much of an effect on the power draw, unfortunately.

Bad VM writeback timeout
Bad Runtime PM for PCI Device Mellanox Technologies MT2910 Family [ConnectX-7]
Bad Runtime PM for PCI Device Realtek Semiconductor Co., Ltd. Device 8127
Bad Runtime PM for PCI Device Mellanox Technologies MT2910 Family [ConnectX-7]
Bad Runtime PM for PCI Device MEDIATEK Corp. Device 7925
Bad Runtime PM for PCI Device Mellanox Technologies MT2910 Family [ConnectX-7]
Bad Runtime PM for PCI Device Samsung Electronics Co Ltd Device a810
Bad Runtime PM for PCI Device Mellanox Technologies MT2910 Family [ConnectX-7]
Good NMI watchdog should be turned off
Good Bluetooth device interface status

I also remembered that wifi and bluetooth were enabled out of the box, so turned both of those off with both rfkill and nmcli radio all off. Not much of an effect on power there, either, unfortunately.

What DID have an effect was disconnecting the direct-attach cable from the ConnectX-7. That dropped power usage by nearly 4W, but it’s still idling at 36W or so.

No more ideas on reducing idle watts while being responsive.

For reducing idle watts while sacrificing the responsiveness, maybe rtcwake with modes standby/freeze might be better.

For keeping the “system performance under load”, depending on what you decide what performance you want wrt to the load, you can run everything, or as much as you can automatically reassign, to the efficiency cores by default, and make performance cores “opt-in”, which is what I do.

EFF=0-4,10-14
PERF=5-9,15-19
sed -i "s@\(GRUB_CMDLINE_LINUX_DEFAULT=.*\)\"@\1 nohz_full=$PERF rcu_nocbs=$PERF\"@" /etc/default/grub
cat /etc/default/grub | rg GRUB_CMDLINE_LINUX_DEFAULT
update-grub
# on systemd-enabled systems this generally works fine (cgroups v2)
mkdir -p /etc/systemd/system/{system,user}.slice.d
printf "[Slice]\nAllowedCPUs=%s\n" "$EFF" > /etc/systemd/system/user.slice.d/99-efficiency.conf
printf "[Slice]\nAllowedCPUs=%s\n" "$EFF" > /etc/systemd/system/system.slice.d/99-efficiency.conf

Doesn’t reduce idle watts but just allows you to be more “efficient” depending on your needs

stress -c 20 should then only trigger your efficiency cores

sudo systemd-run --scope --slice=generic.slice taskset -c 5-9,15-19 stress -c 20

should trigger the performance cores only

sudo systemd-run --scope --slice=generic.slice sudo -u $USER stress -c 20

should trigger all cores

Unfortunately this requires root privileges, making it without root privileges should theoretically be possible, though it would be far more an intensive setup, off the top of my head the easiest way would be need a startup script per user every time instead of this way you can just set and forget. Include sudo -u $USER to make life easier, especially for file outputs.

Couldn’t find any watt figures for performance vs efficiency but if Intel was anything to go by performance cores use a lot of watts.

At this point I think it may just be easier to power them down when they’re not in use. Easy enough to automate with the smart powerstrip. Unless someone from Nvidia chimes in with some additional tuning for the idle draw. I definitely don’t want to be plugging/un-plugging the interconnect all the time. 😂

Thanks for all the suggestions!

1 Like

I picked up a Kill-a-Watt on the way home so I can confirm with as one sample that performance cores and hella inefficient.

Assuming ~200W total power and 0W CPU power at idle it should be roughly divided into

~50-60W GPU

~100W Performance cores

~10W Efficiency cores (this is pretty small but that’s the increase of watts I’m getting with stress -c 20 on the 10 cores).

~50W SoC or whatever is getting the rest

You can probably get sub-100W running power if you isolate the performance cores and cranking out the GPU.

I don’t really know if the performance cores are worth it if it is responsible for half the watts.

EDIT:

GPU is power limited (I believe) to cap around the 100W. It throttles hard to the above ~50W power distribution if you get the CPU pumping.

I have not gotten the GPU to 100W in ML/AI inference, only from running some demo projects from Unreal Engine.

In my own testing, I found the idle power floor to be around 31.5W, with nothing at all connected (besides a tiny USB keyboard dongle), but WiFi enabled.

30W idle seems high for this system, but I think it’s an artifact of moving a server architecture into a mini PC form factor. It reminds me of the Cix CP8180, with 12 cores, idling at 14W. Whatever magic sauce Apple (and to some extent, Qualcomm) has for lowering the idle power draw on their minis is not present in other higher-end Arm desktop designs!

(Power measured via ThirdReality Zigbee Smart Outlet, which has been very reliable in my testing)

2 Likes

You can try to compile 6.17 kernel and use schedutil CPU governor. The default one is performance, so all the cores are running at higher frequencies when idle. The kernel that comes with DGX Spark (even after 6.14 update) doesn’t include schedutil though, only powersave and performance.

When I was running custom 6.17 kernel with schedutil governor, I’ve seen CPU cores going down to 300 MHz or so.

You can try powersave first and see if it makes a difference.

1 Like

Looks like it was the ConnectX-7, we saving almost 20W

2 Likes

A nice reduction on idle power. Could someone share if average power is reduced when performing an inference job? Perhaps using gpt-oss:120b.

For the Ascent GX10 specifically: The new firmware required for the power reduction (to apply on top of the OS update) is in lvfs-testing

With the customary “this is pre-release” warning: sudo fwupdmgr enable-remote lvfs-testingfollowed by sudo fwupdmgr update.

Also gives 2GB more RAM to the OS.

Thanks!! Can confirm on a single node. I’m measuring the power directly at the socket. Max. 24 W idle with this firmware upgrade, previously it was around 40 watts.

Confirmed here as well. From 40W at idle to ~25W. One thing I did notice that might help others is that I had to physically remove the direct-connect cable from the ConnectX-7 NIC to trigger the power savings. It was not sufficient to just have the other node be powered off.