I am experiencing a reproducible hardware-level power issue with my DGX Spark GB10.
Symptoms:
GPU power remains at ~5W under all workloads
GPU utilization stays at 0%
No performance scaling during inference or stress tests
nvidia-smi reports Power Limits as N/A
Thermal limit observed around ~56–57°C (well below expected operating range)
Firmware status:
All NVIDIA high-priority firmware updates have been applied successfully:
Embedded Controller firmware update
SoC / UEFI / GPU firmware update
Kernel logs:
Before the firmware updates, dmesg consistently reported:
“Detected insufficient power on the PCIe slot (27W)”
After applying the firmware updates and performing multiple cold power cycles, the error messages no longer appear, but the behavior remains unchanged.
This strongly suggests a persistent hardware-level power negotiation or power delivery issue (EC / PD / SoC related).
System details:
Product: DGX Spark Founders Edition Mini (GB10)
OS: Linux
Serial Number: 1984025003128
I have full diagnostic logs (fwupdmgr, nvidia-smi, dmesg, journalctl) and a technical report available if needed.
Could you please confirm whether this behavior is expected or if this indicates a hardware defect requiring replacement (RMA)?
Note: I am aware of the related discussions regarding DGX Spark power reporting and GPU power capping around ~80–100W, where nvidia-smi only reflects GPU power and not total system power.
However, this case is fundamentally different:
The GPU in this system does not scale at all under load and remains at ~5W with 0% utilization, regardless of workload. This is not a question of power interpretation or expected GPU power limits, but a persistent lack of power and compute engagement, consistent with a hardware-level safety or power negotiation issue.
please run a workload, in parallel collect the diagnostic logs (fwupdmgr, nvidia-smi, dmesg, journalctl) + nvidia bug report. Then DM me the results (for privacy). Also share with me the exact workload / containers / repro steps used.
Your nvidia-smi log is timestamped Jan 4, 2025 (not 2026). Not a blocker, but worth correcting so no confusion.
If you want, paste the actualdmesg / journalctl -k lines from the run where you saw the 27W PCIe slot message (those lines are not present in the files you zipped).
You mentioned “CUDA/Torch workload cannot currently be run”. We’ll need to run a workload to confirm. An idle system running at ~5w and 0%util would be expected.
GPU temperature stabilizes around ~56–57°C, far below expected operating range
Kernel / platform evidence:
Kernel logs indicate a platform-enforced PCIe slot power limit (27W)
Messages confirm GPU is operational but power-limited
No new errors appear after firmware updates, but behavior remains unchanged
Diagnostics collected in parallel:
fwupdmgr (get-devices, get-updates, history)
nvidia-smi (standard, -q, -q -d POWER, dmon during workload)
dmesg and journalctl (full and filtered)
nvidia-bug-report.sh
Firmware status:
Embedded Controller firmware updated to version 0x02004b03
SoC / UEFI / GPU firmware updated to version 0x02009009
All NVIDIA high-priority firmware updates applied successfully
Conclusion from observed results:
Despite a valid GPU-offloaded workload running to completion, the GPU never exits its idle power state. Power and utilization do not scale under load, which is inconsistent with expected GB10 behavior and points to a hardware-level power delivery / platform issue.
If you would like me to run a specific NVIDIA-recommended validation test (CUDA sample, NGC container, or exact command), please specify the exact workload and I will execute it verbatim and provide the resulting logs.
It looks like you’ve installed the wrong NVIDIA drivers for the DGX Spark platform.
NVIDIA driver should already be installed and you do not have to do anything yourself
a. If you have manually installed a driver you may need to reflash your Spark: System Recovery — DGX Spark User Guide
You should not be running older workloads for a CUDA 13 supported device
Any errors you see are probably from the mismatched driver and workload versions
I don’t know how you got there, but some users reported this when installing the GPU Operator and device plugin containers running in Kubernetes pods. GPU Operator does support DGX Spark. Please follow the GPU Operator User guide on how to deploy it, specifically for preinstalled NVIDIA drivers and container toolkit:
Thanks Raphael for the response, but as I mentioned, this does not match my scenario at all, and marking the thread as “Solved” is premature and incorrect.
To recap clearly:
No manual driver installation, no GPU Operator, no Kubernetes, no device plugins – pure stock DGX Spark OS with preinstalled driver 580.95.05 (CUDA 13.0).
All firmware applied via fwupdmgr.
GPU is fully detected and loaded, nvidia-smi works without communication errors.
Issue: Persistent power draw stuck at ~5W, 0% utilization under heavy workloads, Power Limits reported as N/A – even after reboots and 10+ hour cold power-off.
This is not a driver mismatch (which would cause init failures or NVML errors). It points to a platform-level power capping issue, similar to other reported cases (e.g., Safety Mode throttling, PD negotiation failures, or early Blackwell power delivery bugs).
Please reopen / unmark as “Solved” and advise next steps:
Recommended NVIDIA validation tools/workloads for power diagnostics?
Escalate to engineering for potential firmware/hotfix or hardware evaluation (RMA)?
On my side, the thread was displayed with a “Solved” label after the earlier reply, which is why I mentioned it. If that was a UI or caching artifact and the topic is not marked as solved internally, that’s perfectly fine.
Based on the current evidence, we can rule out a driver mismatch: the system is running the stock DGX Spark OS with the preinstalled NVIDIA driver, the GPU is correctly detected and initialized, and there are no NVML or initialization errors. The issue is limited to persistent power and utilization capping.
To move this forward, could you please advise on the preferred next step from NVIDIA’s side?
From my perspective, there are two clear options:
You recommend a specific NVIDIA validation workload (exact CUDA sample, command, or NGC container), which I will run verbatim and provide the corresponding logs.
The issue is escalated for platform-level evaluation (power delivery / EC / board), given the persistent ~5W GPU power under active workloads despite updated firmware and extended cold power-off.
Please let me know which path you recommend so I can proceed accordingly.
I have observed this behavior since first power-on / initial setup of the system.
The GPU has never exited the ~5W power state or shown non-zero utilization under load at any point. There were no configuration changes, no manual driver installations, and no workload changes associated with the onset of this behavior.
I initially assumed this was a known early-platform issue that would be resolved via firmware updates. For this reason, I waited for and applied all NVIDIA-provided high-priority firmware updates via fwupdmgr, specifically:
Embedded Controller firmware
version 0x02004b03 (release date: 2025-10-24)
I DM’d you the full validation bundle (tar.gz) + NVIDIA bug report output for privacy.
If you need a specific NVIDIA-approved workload (exact command/container) to force power scaling, tell me which one and I’ll run it verbatim and report back.
The issue was not hardware or firmware related.
Root cause was an outdated driver / CUDA stack:
Previously: Driver 550.54.15 + CUDA 12.4 → GPU stuck at ~5W, 0% util under load
After running gpu-burn (which pulled a newer stack) and upgrading to Driver 580.95.05 + CUDA 13.0, the GB10 now initializes and scales power correctly under compute load
So this was a driver mismatch on Blackwell/Grace (GB10, ARM64).
Firmware (EC/SoC) was already up to date and not the problem.
Special thanks to elsaco for pushing in the right direction – much appreciated.