My DGX spark has been completely shutting down under sustained inference tasks (spark-vllm settings for qwen3.6-27b-fp8 w MTP). I’ve previously trained ColBERT models and smaller language models without any issues.
After several failures, I followed the field diagnostic instructions and my system seems to have failed on the Power Stress test w the code: MODS-020000600139
Notably, during the test the spark did not shut down, it just failed the test.
Any experience folks have with this? I believe I’ve created a ticket in the right place already. Is this just a RMA situation, or are there things I can try locally re:firmware. Not optimistic.
i had a similar issue where any NVFP4 models under load would randomly hang my cluster. heatsinks and a fan fixed the issue. not sure if it’s actually a thermal issue because i’ve seen temps go higher and not crash. weird stuff man.
I got same (have not run field diagnostic as it is in rack, hard to disable secure boot without KVM), Asus gx10, 2 working in pair with QSFP, shutdowns without logs under first-ish load, power led goes off, and funny thing, when I switch it on by pressing power button it just works after. Overnight I have automation which does grace shutdown of server and cuts off power from wall to bring it back in morning by re-enable power to brick.
It was fine until last fw update in April, now every single time on re-enabling wall power it shutdown under load, and completely fine on next boot.
Before doing RMA I strongly suggest monitoring the temperature of your GPU the spark is know to overheat by default under load. You can prevent the overheating by cooling the spark with a external fan. I 3d printed a enclosure for my dual sparks that mounts them upright and has a usb 120mm fan blowing down on them. Keeps the temps in the low to mid 70 C’s with momentary spikes to low 80’s. Before adding a fan mine would be 83-85c about 5 minutes into a intensive workload. Adding the fan fixed the issue for my and my sparks run smoothly.
I’ve run separately and together gpu burn and stress ng, and ib write, nothing can reproduce this random shutdown like vLLM running model in TP mode on 2 sparks.
I was seeing hangs on single Spark for large models due to OOM and network issues. Wrote a short guide based on what worked for me, hope it helps: Hardening Your DGX Spark for AI Workloads - Geeta