DGX Spark shutting down under load - MODS-020000600139

smolllms · June 6, 2026, 3:58am

My DGX spark has been completely shutting down under sustained inference tasks (spark-vllm settings for qwen3.6-27b-fp8 w MTP). I’ve previously trained ColBERT models and smaller language models without any issues.

After several failures, I followed the field diagnostic instructions and my system seems to have failed on the Power Stress test w the code: MODS-020000600139

Notably, during the test the spark did not shut down, it just failed the test.

Any experience folks have with this? I believe I’ve created a ticket in the right place already. Is this just a RMA situation, or are there things I can try locally re:firmware. Not optimistic.

susni · June 6, 2026, 4:42am

i had a similar issue where any NVFP4 models under load would randomly hang my cluster. heatsinks and a fan fixed the issue. not sure if it’s actually a thermal issue because i’ve seen temps go higher and not crash. weird stuff man.

alper.tor · June 6, 2026, 6:31am

I am having a similar issue. It is about the prefill operations. Capping at 2200MHz solved my problems with 3-4% reduction in completion times.

Teason2026 · June 7, 2026, 5:45am

I got same (have not run field diagnostic as it is in rack, hard to disable secure boot without KVM), Asus gx10, 2 working in pair with QSFP, shutdowns without logs under first-ish load, power led goes off, and funny thing, when I switch it on by pressing power button it just works after. Overnight I have automation which does grace shutdown of server and cuts off power from wall to bring it back in morning by re-enable power to brick.

It was fine until last fw update in April, now every single time on re-enabling wall power it shutdown under load, and completely fine on next boot.

smolllms · June 7, 2026, 7:22pm

Officially I should be going through RMA right? I guess I’m wondering if it’s worth doing that given that other folks report similar issues.

corbett_korbett · June 7, 2026, 8:38pm

Before doing RMA I strongly suggest monitoring the temperature of your GPU the spark is know to overheat by default under load. You can prevent the overheating by cooling the spark with a external fan. I 3d printed a enclosure for my dual sparks that mounts them upright and has a usb 120mm fan blowing down on them. Keeps the temps in the low to mid 70 C’s with momentary spikes to low 80’s. Before adding a fan mine would be 83-85c about 5 minutes into a intensive workload. Adding the fan fixed the issue for my and my sparks run smoothly.

Teason2026 · June 8, 2026, 4:28am

it does not overheat, it goes like this:

user@host:~$ nvidia-smi dmon -s p -d 1

# gpu pwr gtemp mtemp

# Idx W C C
0     41     80      - 

0     41     80      - 

0     41     80      - 

0     45     81      - 

0     41     81      - 

0     42     81      - 

0     41     80      - 

0     41     81      - 

0     43     81      -  
Read from remote host host: Operation timed out

Connection to host closed.

client_loop: send disconnect: Broken pipe

and it does not OOM.

I’ve run separately and together gpu burn and stress ng, and ib write, nothing can reproduce this random shutdown like vLLM running model in TP mode on 2 sparks.

Teason2026 · June 8, 2026, 6:31pm

I did field diagnostic and 2 failing units passes test, yet still shutdowns on inference only.

geeta.chauhan · June 8, 2026, 10:23pm

I was seeing hangs on single Spark for large models due to OOM and network issues. Wrote a short guide based on what worked for me, hope it helps: Hardening Your DGX Spark for AI Workloads - Geeta

Topic		Replies	Views
DGX Spark (GB10) reproducibly hard powers-off under GPU load — fully updated, zero crash capture DGX Spark / GB10 boot , kernel , ota	13	344	June 14, 2026
DGXSPARK temperature too high, automatic shutdown。 DGX Spark / GB10	170	5783	June 22, 2026
DGX Spark freezes under load DGX Spark / GB10	2	155	June 27, 2026
DGX Spark Shutdown around 95°C during nanoChat Pretraining (20-30 min) DGX Spark / GB10	21	1757	March 23, 2026
DGX Spark. low fan speed, high temps, device very hot DGX Spark / GB10 kernel , gpu , fan-facts , debugging-and-troubleshooting	60	5056	May 26, 2026
Unexpected Shutdown During ComfyUI Inference on DGX Spark (Occurs on Two Units) DGX Spark / GB10 cuda , ai , gpu	4	487	March 12, 2026
Dgx spark shut down without rebooting DGX Spark / GB10	22	707	May 20, 2026
DGX Spark hangs under vLLM load, fieldiag fails on the thermal sensor DGX Spark / GB10	6	398	May 12, 2026
Shutdown Issue: Is This a Hardware Problem or Software Problem? DGX Spark / GB10 boot , cuda , kernel , reboot	3	235	March 19, 2026
My DGX Spark Hangs ... is this normal? DGX Spark / GB10 Projects llm , dgx	4	447	April 13, 2026

DGX Spark shutting down under load - MODS-020000600139

Related topics