GB10 GPU Power Stuck Around 37W When Running LLMs (Gemma 4 26B / Qwen 3.6 27B)

Hi,

I’ve been using a DGX Spark system for about a month. I’m still relatively new to this area, but I’ve been learning through hands-on experience.

I’m currently encountering a performance issue that I don’t fully understand.

When running LLMs—especially larger models like Gemma 4 26B and Qwen 3.6 27B—I consistently observe that the GB10 GPU power consumption stays around ~37W and does not scale higher under load.

Because of this, the inference speed is significantly slower than expected.

What I find confusing:

  • The GPU appears to be active, but power usage is very low

  • It never ramps up beyond ~37W even during sustained inference

  • This happens consistently across different models

I’m trying to determine whether:

  1. This is expected behavior for GB10 (power-limited by design?), or

  2. There is a configuration / software issue (e.g., CUDA, driver, vLLM, PyTorch, etc.) causing the GPU to not fully utilize its capacity

If anyone has experience with DGX Spark or GB10, I would really appreciate your insights.

Additional context:

  • Workload: LLM inference (vLLM / similar frameworks)

  • Models tested: Gemma 4 26B, Qwen 3.6 27B

  • Issue: Low GPU power usage (~37W cap) + slow response

Thanks in advance for any suggestions.

cychen@spark-7e3d:~/Downloads$ nvidia-smi
Fri Apr 24 22:50:47 2026
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.142 Driver Version: 580.142 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 On | N/A |
| N/A 54C P0 35W / N/A | Not Supported | 95% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2718 C …chen/talktype/venv/bin/python 3744MiB |
| 0 N/A N/A 225204 G /usr/lib/xorg/Xorg 423MiB |
| 0 N/A N/A 225386 G /usr/bin/gnome-shell 256MiB |
| 0 N/A N/A 226028 G …exec/xdg-desktop-portal-gnome 67MiB |
| 0 N/A N/A 1224186 G /usr/bin/nautilus 68MiB |
| 0 N/A N/A 2051877 G …/.mount_ObsiditW7Vs3/obsidian 63MiB |
| 0 N/A N/A 2404386 G …/8188/usr/lib/firefox/firefox 68MiB |
| 0 N/A N/A 2559727 G /usr/share/code/code 180MiB |
| 0 N/A N/A 3567994 C VLLM::EngineCore 73908MiB |
±----------------------------------------------------------------------------------------+

You are most likley experiencing the PD bug. Unplug the PSU from the Spark and the wall for 2 minutes and plug it back in. You should have full power afterwards.

Thanks for your reply. I just did it but the situation does not improve a lot. The response by the Qwen 27b fp8 is so slow that average generation throughput is around 6-7 tokens. It seems faster when running Gemma4 26B NVFP4. I don’t know the real performance limit of this DGX Spark.

| NVIDIA-SMI 580.142 Driver Version: 580.142 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 On | N/A |
| N/A 66C P0 38W / N/A | Not Supported | 95% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2701 C …chen/talktype/venv/bin/python 3728MiB |
| 0 N/A N/A 5815 G /usr/lib/xorg/Xorg 212MiB |
| 0 N/A N/A 6021 G /usr/bin/gnome-shell 184MiB |
| 0 N/A N/A 7002 G /usr/share/code/code 183MiB |
| 0 N/A N/A 53124 C VLLM::EngineCore 89368MiB |
| 0 N/A N/A 63506 G …/8188/usr/lib/firefox/firefox 10MiB |
±----------------------------------------------------------------------------------------+

The Spark is limited mainly by its terrible memory bandwidth. You will see higher power consumption (and much higher aggregate outputs) if you run vllm in batched mode. You will also see higher power consumption running models like 122B-A10B, or use speculative decoding, because then the GPU actually gets to do something as opposed to just waiting for the memory to shuffle the bits back and forth. A single session of A35B-A3B with no MTP uses 35 watt. A single sesison of 122B-A10B-int4 with MTP-1 on the same machine uses 95 watt and you can hear the difference.

You probably won’t see much more than that with the 27b its really bottlenecked by the memory bandwidth not compute which is what would drive your wattage higher.
I have also noticed that in most situations especially fp8 that it doesn’t really need that much wattage to get the full clock speed and max out the memory throughput.

lol ^ yup

Thank you.

Thank you. I did see a flash power up to 50W in less than 0.2 seconds. But I still need to find out the very low throughput which is around 7 tokens/sec.

It also depends on what task is your GPU processing and which models. There are models I run that pegs the GPU to 80w, and others that are at ~40W all the time with small peaks over that mark.

Sparkview — GPU monitor tool with GB10-aware unified memory handling This is a neat tool to use for monitoring your system :)

1 Like