DGX Spark shutting down under load - MODS-020000600139

My DGX spark has been completely shutting down under sustained inference tasks (spark-vllm settings for qwen3.6-27b-fp8 w MTP). I’ve previously trained ColBERT models and smaller language models without any issues.

After several failures, I followed the field diagnostic instructions and my system seems to have failed on the Power Stress test w the code: MODS-020000600139

Notably, during the test the spark did not shut down, it just failed the test.

Any experience folks have with this? I believe I’ve created a ticket in the right place already. Is this just a RMA situation, or are there things I can try locally re:firmware. Not optimistic.

i had a similar issue where any NVFP4 models under load would randomly hang my cluster. heatsinks and a fan fixed the issue. not sure if it’s actually a thermal issue because i’ve seen temps go higher and not crash. weird stuff man.

I am having a similar issue. It is about the prefill operations. Capping at 2200MHz solved my problems with 3-4% reduction in completion times.

I got same (have not run field diagnostic as it is in rack, hard to disable secure boot without KVM), Asus gx10, 2 working in pair with QSFP, shutdowns without logs under first-ish load, power led goes off, and funny thing, when I switch it on by pressing power button it just works after. Overnight I have automation which does grace shutdown of server and cuts off power from wall to bring it back in morning by re-enable power to brick.

It was fine until last fw update in April, now every single time on re-enabling wall power it shutdown under load, and completely fine on next boot.

Officially I should be going through RMA right? I guess I’m wondering if it’s worth doing that given that other folks report similar issues.

Before doing RMA I strongly suggest monitoring the temperature of your GPU the spark is know to overheat by default under load. You can prevent the overheating by cooling the spark with a external fan. I 3d printed a enclosure for my dual sparks that mounts them upright and has a usb 120mm fan blowing down on them. Keeps the temps in the low to mid 70 C’s with momentary spikes to low 80’s. Before adding a fan mine would be 83-85c about 5 minutes into a intensive workload. Adding the fan fixed the issue for my and my sparks run smoothly.

it does not overheat, it goes like this:

user@host:~$ nvidia-smi dmon -s p -d 1

# gpu pwr gtemp mtemp

# Idx W C C

0     41     80      - 

0     41     80      - 

0     41     80      - 

0     45     81      - 

0     41     81      - 

0     42     81      - 

0     41     80      - 

0     41     81      - 

0     43     81      -  

Read from remote host host: Operation timed out

Connection to host closed.

client_loop: send disconnect: Broken pipe

and it does not OOM.

I’ve run separately and together gpu burn and stress ng, and ib write, nothing can reproduce this random shutdown like vLLM running model in TP mode on 2 sparks.

I did field diagnostic and 2 failing units passes test, yet still shutdowns on inference only.

I was seeing hangs on single Spark for large models due to OOM and network issues. Wrote a short guide based on what worked for me, hope it helps: Hardening Your DGX Spark for AI Workloads - Geeta