Shutdown Issue: Is This a Hardware Problem or Software Problem?

Hello everyone,

This issue was almost non-existent last year, but recently I’ve been seeing shutdown issues more frequently.

What’s particularly concerning is that this problem doesn’t occur early on, but rather affects users who have been using the device for some time.


I experienced this issue myself, and I’ve already sent an RMA to Nvidia through the Elite Partner company where I purchased the product. I provided field diagnostic logs to an Nvidia Moderator via DM and was recommended to proceed with RMA, which is why I took this action.


Recently, the shutdown issue that has been rapidly increasing appears to occur particularly during high-load AI workloads, such as:

  • @eugr’s llama-benchy
  • Video generation using the Wan2.2 default template in ComfyUI

It seems that shutdowns occur not only during test workflows, but also during sustained high-load AI work in actual production scenarios.


The only temporary workaround for this issue is to set min and max values to approximately 2300 or below using commands like:

sudo nvidia-smi -lgc min,max

Gradually lower the maximum value until shutdowns stop recurring.


However, “sudo nvidia-smi -lgc min,max” is NOT a solution.

If adequate thermal management and stable power delivery are guaranteed, the device should operate normally without limiting the base clock.

My other normal Spark units, as long as adequate thermal management is guaranteed, consistently draw over 100W of power and maintain high clocks of 2400~2600.

Most importantly, a normal Spark should automatically regulate power and clock speeds appropriately even without ideal thermal management conditions, preventing shutdowns from occurring.

However, devices exhibiting shutdown symptoms appear to be unable to sustain continuous operation during high-load AI work that requires consistently high clocks & power levels, even with the same firmware, software, and adequate thermal management environments.


Therefore, I’m posting this so that Nvidia’s DGX Spark department staff can investigate this issue more clearly and help resolve it.

Although I’ve already sent my unit for RMA, I cannot be certain that this problem won’t occur in other normal units I currently have, so I’m paying closer attention to this issue.


If anyone else has experienced related problems, I think leaving information about at what stage the shutdown occurred would help us track down the issue more easily.

I’ve left the information and the logs to mods here. I have the exact same reproducible problem. I have no confidence they will confirm a defect or return me a unit that doesn’t exhibit it.

Hi Josephbreda, if you’ve initiated a RMA, please DM me your case reference, so i can route it to engineering for review.

I sent logs but have not received further information. I’m a bit skeptical about sending the unit away for an undetermined amount of time. Right now it runs continuously if I reduce the clocks. I’d really like to see some acknowledgement of a broader problem before I send it back.