GB10 GPU Power Stuck Around 37W When Running LLMs (Gemma 4 26B / Qwen 3.6 27B)

user115910 · April 24, 2026, 2:51pm

Hi,

I’ve been using a DGX Spark system for about a month. I’m still relatively new to this area, but I’ve been learning through hands-on experience.

I’m currently encountering a performance issue that I don’t fully understand.

When running LLMs—especially larger models like Gemma 4 26B and Qwen 3.6 27B—I consistently observe that the GB10 GPU power consumption stays around ~37W and does not scale higher under load.

Because of this, the inference speed is significantly slower than expected.

What I find confusing:

The GPU appears to be active, but power usage is very low
It never ramps up beyond ~37W even during sustained inference
This happens consistently across different models

I’m trying to determine whether:

This is expected behavior for GB10 (power-limited by design?), or
There is a configuration / software issue (e.g., CUDA, driver, vLLM, PyTorch, etc.) causing the GPU to not fully utilize its capacity

If anyone has experience with DGX Spark or GB10, I would really appreciate your insights.

Additional context:

Workload: LLM inference (vLLM / similar frameworks)
Models tested: Gemma 4 26B, Qwen 3.6 27B
Issue: Low GPU power usage (~37W cap) + slow response

Thanks in advance for any suggestions.

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2718 C …chen/talktype/venv/bin/python 3744MiB |
| 0 N/A N/A 225204 G /usr/lib/xorg/Xorg 423MiB |
| 0 N/A N/A 225386 G /usr/bin/gnome-shell 256MiB |
| 0 N/A N/A 226028 G …exec/xdg-desktop-portal-gnome 67MiB |
| 0 N/A N/A 1224186 G /usr/bin/nautilus 68MiB |
| 0 N/A N/A 2051877 G …/.mount_ObsiditW7Vs3/obsidian 63MiB |
| 0 N/A N/A 2404386 G …/8188/usr/lib/firefox/firefox 68MiB |
| 0 N/A N/A 2559727 G /usr/share/code/code 180MiB |
| 0 N/A N/A 3567994 C VLLM::EngineCore 73908MiB |
±----------------------------------------------------------------------------------------+

mashie · April 24, 2026, 3:00pm

You are most likley experiencing the PD bug. Unplug the PSU from the Spark and the wall for 2 minutes and plug it back in. You should have full power afterwards.

user115910 · April 24, 2026, 3:46pm

Thanks for your reply. I just did it but the situation does not improve a lot. The response by the Qwen 27b fp8 is so slow that average generation throughput is around 6-7 tokens. It seems faster when running Gemma4 26B NVFP4. I don’t know the real performance limit of this DGX Spark.

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2701 C …chen/talktype/venv/bin/python 3728MiB |
| 0 N/A N/A 5815 G /usr/lib/xorg/Xorg 212MiB |
| 0 N/A N/A 6021 G /usr/bin/gnome-shell 184MiB |
| 0 N/A N/A 7002 G /usr/share/code/code 183MiB |
| 0 N/A N/A 53124 C VLLM::EngineCore 89368MiB |
| 0 N/A N/A 63506 G …/8188/usr/lib/firefox/firefox 10MiB |
±----------------------------------------------------------------------------------------+

yaro.tal · April 24, 2026, 10:15pm

The Spark is limited mainly by its terrible memory bandwidth. You will see higher power consumption (and much higher aggregate outputs) if you run vllm in batched mode. You will also see higher power consumption running models like 122B-A10B, or use speculative decoding, because then the GPU actually gets to do something as opposed to just waiting for the memory to shuffle the bits back and forth. A single session of A35B-A3B with no MTP uses 35 watt. A single sesison of 122B-A10B-int4 with MTP-1 on the same machine uses 95 watt and you can hear the difference.

bensonjohnson · April 24, 2026, 10:15pm

You probably won’t see much more than that with the 27b its really bottlenecked by the memory bandwidth not compute which is what would drive your wattage higher.
I have also noticed that in most situations especially fp8 that it doesn’t really need that much wattage to get the full clock speed and max out the memory throughput.

lol ^ yup

user115910 · April 25, 2026, 12:07am

Thank you.

user115910 · April 25, 2026, 12:11am

Thank you. I did see a flash power up to 50W in less than 0.2 seconds. But I still need to find out the very low throughput which is around 7 tokens/sec.

azampatti · April 25, 2026, 1:17am

It also depends on what task is your GPU processing and which models. There are models I run that pegs the GPU to 80w, and others that are at ~40W all the time with small peaks over that mark.

Sparkview — GPU monitor tool with GB10-aware unified memory handling This is a neat tool to use for monitoring your system :)

Topic		Replies	Views
DGX Spark Power Consumption DGX Spark / GB10	2	573	November 3, 2025
Nvidia spark dgx GB10 fine-tune slow time problem - Urgent HELP DGX Spark / GB10 llama	5	159	February 26, 2026
Max observed wattage DGX Spark / GB10 llama	4	181	March 8, 2026
GB10 is power limited after crash DGX Spark / GB10 cuda	6	337	April 18, 2026
When we install an LLM model and start a chat session, the response speed becomes extremely slow DGX Spark / GB10 llama	1	323	December 6, 2025
DGX Spark Observed Behaviour – Power Draw / Performance Discrepancy DGX Spark / GB10 performance	5	903	October 30, 2025
Only getting half the advertised performance and capping at 100W? DGX Spark / GB10	14	1417	December 11, 2025
DGX Spark GB10 GPU is stuck at ~5W power and 0% utilization even after all NVIDIA firmware updates DGX Spark / GB10 power , rma , spark , firmware-tools	15	865	January 28, 2026
Investigating Performance Issue/Bottleneck DGX Spark / GB10 llama , agentic-ai	9	649	February 1, 2026
DGX Spark Power Clarification DGX Spark / GB10	2	2788	October 31, 2025

GB10 GPU Power Stuck Around 37W When Running LLMs (Gemma 4 26B / Qwen 3.6 27B)

Related topics