DGX Spark GB10 GPU is stuck at ~5W power and 0% utilization even after all NVIDIA firmware updates

Hello NVIDIA team,

I am experiencing a reproducible hardware-level power issue with my DGX Spark GB10.

Symptoms:

  • GPU power remains at ~5W under all workloads
  • GPU utilization stays at 0%
  • No performance scaling during inference or stress tests
  • nvidia-smi reports Power Limits as N/A
  • Thermal limit observed around ~56–57°C (well below expected operating range)

Firmware status:
All NVIDIA high-priority firmware updates have been applied successfully:

  • Embedded Controller firmware update
  • SoC / UEFI / GPU firmware update

Kernel logs:
Before the firmware updates, dmesg consistently reported:
“Detected insufficient power on the PCIe slot (27W)”

After applying the firmware updates and performing multiple cold power cycles, the error messages no longer appear, but the behavior remains unchanged.

This strongly suggests a persistent hardware-level power negotiation or power delivery issue (EC / PD / SoC related).

System details:

  • Product: DGX Spark Founders Edition Mini (GB10)
  • OS: Linux
  • Serial Number: 1984025003128

I have full diagnostic logs (fwupdmgr, nvidia-smi, dmesg, journalctl) and a technical report available if needed.

Could you please confirm whether this behavior is expected or if this indicates a hardware defect requiring replacement (RMA)?

Thank you for your support.

1 Like

Note: I am aware of the related discussions regarding DGX Spark power reporting and GPU power capping around ~80–100W, where nvidia-smi only reflects GPU power and not total system power.

However, this case is fundamentally different:
The GPU in this system does not scale at all under load and remains at ~5W with 0% utilization, regardless of workload. This is not a question of power interpretation or expected GPU power limits, but a persistent lack of power and compute engagement, consistent with a hardware-level safety or power negotiation issue.

please run a workload, in parallel collect the diagnostic logs (fwupdmgr, nvidia-smi, dmesg, journalctl) + nvidia bug report. Then DM me the results (for privacy). Also share with me the exact workload / containers / repro steps used.

thanks for DM’ing me the log bundle.

Your nvidia-smi log is timestamped Jan 4, 2025 (not 2026). Not a blocker, but worth correcting so no confusion.

If you want, paste the actual dmesg / journalctl -k lines from the run where you saw the 27W PCIe slot message (those lines are not present in the files you zipped).

You mentioned “CUDA/Torch workload cannot currently be run”. We’ll need to run a workload to confirm. An idle system running at ~5w and 0%util would be expected.

Thanks for the review. Below are the executed tests and their concrete results.

Executed workload:

  • Ollama inference using model gemma2:2b
  • GPU offload explicitly enabled
  • Bare-metal execution (no containers, no virtualization)
  • Inference completes successfully (generation finishes normally)

Observed results during active workload:

  • GPU power remains fixed at ~5W throughout the run
  • GPU utilization remains at 0%
  • No increase in clocks or power draw observed
  • nvidia-smi reports Power Limits: N/A
  • GPU temperature stabilizes around ~56–57°C, far below expected operating range

Kernel / platform evidence:

  • Kernel logs indicate a platform-enforced PCIe slot power limit (27W)
  • Messages confirm GPU is operational but power-limited
  • No new errors appear after firmware updates, but behavior remains unchanged

Diagnostics collected in parallel:

  • fwupdmgr (get-devices, get-updates, history)
  • nvidia-smi (standard, -q, -q -d POWER, dmon during workload)
  • dmesg and journalctl (full and filtered)
  • nvidia-bug-report.sh

Firmware status:

  • Embedded Controller firmware updated to version 0x02004b03
  • SoC / UEFI / GPU firmware updated to version 0x02009009
  • All NVIDIA high-priority firmware updates applied successfully

Conclusion from observed results:
Despite a valid GPU-offloaded workload running to completion, the GPU never exits its idle power state. Power and utilization do not scale under load, which is inconsistent with expected GB10 behavior and points to a hardware-level power delivery / platform issue.

If you would like me to run a specific NVIDIA-recommended validation test (CUDA sample, NGC container, or exact command), please specify the exact workload and I will execute it verbatim and provide the resulting logs.

spark_diagnostics_ollama_complete_20260104b.zip (5.3 KB)

It looks like you’ve installed the wrong NVIDIA drivers for the DGX Spark platform.

  1. NVIDIA driver should already be installed and you do not have to do anything yourself
    a. If you have manually installed a driver you may need to reflash your Spark: System Recovery — DGX Spark User Guide
  2. You should not be running older workloads for a CUDA 13 supported device
  3. Any errors you see are probably from the mismatched driver and workload versions

I don’t know how you got there, but some users reported this when installing the GPU Operator and device plugin containers running in Kubernetes pods. GPU Operator does support DGX Spark. Please follow the GPU Operator User guide on how to deploy it, specifically for preinstalled NVIDIA drivers and container toolkit:

Thanks Raphael for the response, but as I mentioned, this does not match my scenario at all, and marking the thread as “Solved” is premature and incorrect.

To recap clearly:

  • No manual driver installation, no GPU Operator, no Kubernetes, no device plugins – pure stock DGX Spark OS with preinstalled driver 580.95.05 (CUDA 13.0).
  • All firmware applied via fwupdmgr.
  • GPU is fully detected and loaded, nvidia-smi works without communication errors.
  • Issue: Persistent power draw stuck at ~5W, 0% utilization under heavy workloads, Power Limits reported as N/A – even after reboots and 10+ hour cold power-off.

This is not a driver mismatch (which would cause init failures or NVML errors). It points to a platform-level power capping issue, similar to other reported cases (e.g., Safety Mode throttling, PD negotiation failures, or early Blackwell power delivery bugs).

Please reopen / unmark as “Solved” and advise next steps:

  • Recommended NVIDIA validation tools/workloads for power diagnostics?
  • Escalate to engineering for potential firmware/hotfix or hardware evaluation (RMA)?

Appreciate your help in resolving this properly.

The topic is not solved. Where are you seeing this?

@raphael.amorim Thanks for the clarification.

On my side, the thread was displayed with a “Solved” label after the earlier reply, which is why I mentioned it. If that was a UI or caching artifact and the topic is not marked as solved internally, that’s perfectly fine.

Based on the current evidence, we can rule out a driver mismatch: the system is running the stock DGX Spark OS with the preinstalled NVIDIA driver, the GPU is correctly detected and initialized, and there are no NVML or initialization errors. The issue is limited to persistent power and utilization capping.

To move this forward, could you please advise on the preferred next step from NVIDIA’s side?

From my perspective, there are two clear options:

  1. You recommend a specific NVIDIA validation workload (exact CUDA sample, command, or NGC container), which I will run verbatim and provide the corresponding logs.
  2. The issue is escalated for platform-level evaluation (power delivery / EC / board), given the persistent ~5W GPU power under active workloads despite updated firmware and extended cold power-off.

Please let me know which path you recommend so I can proceed accordingly.

HI LQWTECH,

Question: how long have you observed this behavior?

Hi NVES,

I have observed this behavior since first power-on / initial setup of the system.

The GPU has never exited the ~5W power state or shown non-zero utilization under load at any point. There were no configuration changes, no manual driver installations, and no workload changes associated with the onset of this behavior.

I initially assumed this was a known early-platform issue that would be resolved via firmware updates. For this reason, I waited for and applied all NVIDIA-provided high-priority firmware updates via fwupdmgr, specifically:

  • Embedded Controller firmware
    version 0x02004b03 (release date: 2025-10-24)

  • SoC / UEFI / GPU firmware
    version 0x02009009 (release date: 2025-10-24)

In my case, the behavior remains unchanged after these updates and after extended cold power-off testing (10+ hours fully powered off).

To be clear: there has never been a point where the GPU operated at normal power levels or normal performance.

Please let me know how you would like to proceed next.

If you run:

sudo apt update
sudo apt dist-upgrade
sudo fwupdmgr refresh
sudo fwupdmgr upgrade

And copy/paste the output here in this thread. Then run sudo reboot and after reboot, run:

nvidia-smi
docker run --runtime=nvidia --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
nvcc --version
dmesg | grep -i "firmware"
fwupdmgr get-devices
fwupdmgr history

And also copy them here.
Thanks

@raphael.amorim Acknowledged — I ran your exact command set on the stock DGX Spark OS.

High-level results:

  • nvidia-smi still shows GB10 stuck at ~5W and 0% util (Power Cap / Power Limits: N/A).
  • CUDA container nvidia-smi also reports the same (host + container consistent).
  • fwupdmgr refresh/upgrade: no further updates available.
  • fwupdmgr get-devices/history confirm the applied EC + SoC/UEFI/GPU firmware (Oct 24, 2025 releases).
  • dmesg | grep -i firmware output captured.

I DM’d you the full validation bundle (tar.gz) + NVIDIA bug report output for privacy.
If you need a specific NVIDIA-approved workload (exact command/container) to force power scaling, tell me which one and I’ll run it verbatim and report back.

@LQWTECH the insufficent power kernel message is from the mlx5_core driver. Just ignore it! The issue is being discussed in another thread also.

Could you post the output of nvidia-smi? This is how it looks on an (idle) Spark with latest firmware and drivers update:

Mon Jan  5 16:20:56 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   34C    P8              4W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

and gpu-burn runing:

Mon Jan  5 16:57:39 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   48C    P0             28W /  N/A  | Not Supported          |     96%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3359      C   ./gpu_burn                            10514... |
+-----------------------------------------------------------------------------------------+
1 Like

Update / Resolved – Thanks elsaco

Hi all, quick update to close this thread.

The issue was not hardware or firmware related.
Root cause was an outdated driver / CUDA stack:

  • Previously: Driver 550.54.15 + CUDA 12.4 → GPU stuck at ~5W, 0% util under load

  • After running gpu-burn (which pulled a newer stack) and upgrading to Driver 580.95.05 + CUDA 13.0, the GB10 now initializes and scales power correctly under compute load

So this was a driver mismatch on Blackwell/Grace (GB10, ARM64).
Firmware (EC/SoC) was already up to date and not the problem.

Special thanks to elsaco for pushing in the right direction – much appreciated.

Issue resolved.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.