I purchased a DGX Spark approximately 3 days ago (I bought it onsite at GTC on Friday, in the Build Your Claw tent) and I am experiencing severe and escalating instability. I have had 9 confirmed silent crashes today alone (March 23), and additional crashes on previous days. I believe I have a defective unit and am looking to initiate a warranty replacement.
Crash behavior:
Screen goes black with no recovery — long pressing the power button does not help
Full power disconnection from the wall is required to get the unit back
Crashes occur both under heavy GPU load (running Nemotron 3 Super / Nemotron 3 Super via Ollama) AND at near-idle when no inference is actively running
No kernel panic, no graceful shutdown — logs end abruptly with no error trace (silent crash)
Reboot history from today (March 23) alone:
Mar 23 16:19 — still running (current)
Mar 23 15:52 — crashed after ~2 minutes
Mar 23 15:18 — crashed
Mar 23 14:54 — crashed
Mar 23 14:05 — crashed
Mar 23 13:38 — crashed
Mar 23 13:05 — crashed
Mar 23 12:33 — crashed
Mar 23 12:23 - 12:27 (3 min uptime)
Hardware errors appearing on every boot:
platform NVDA8800:00: failed to claim resource 0: [mem 0x05170000-0x051cffff]
platform NVDA8900:00: platform device creation failed: -16
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command: -5
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
The PCI DOE mailbox timeout is present on every single boot and appears to be a hardware-level communication failure.
Setup details:
Unit is in an air conditioned room with clear airflow on all sides
Temperatures are warm to the touch but not hot, has also crashed when lukewarm
Using the original supplied power adapter
OS: Ubuntu with kernel 6.17.0-1008-nvidia
Ollama running as a background service (OpenClaw)
What I have tried:
Different power outlets — no improvement
Setting OLLAMA_KEEP_ALIVE=0 to unload models between sessions
Attempted to install nvidia-field-diag but hit a keyring conflict with the existing CUDA repo — happy to run diagnostics if NVIDIA support can advise the correct install method for the DGX Spark
I have full journalctl logs from the crashed sessions available to share. All crashes are confirmed silent — logs end abruptly with no error message.
Given that this unit is 3 days old and crashing 9+ times per day including at idle, I believe this is a defective unit and would like to proceed with a replacement.
Tagging @NVES as I have seen you assist with similar hardware cases on this forum.
Are you sure you not causing an OOM condition? That’s too big of a model to run on a a single Spark. And why, for the love of god, do people continue to insist on running Ollama!
It was flashed from USB with the newest setup from NVIDIA at the GTC event. Nvidia employees set it up for me and they were the ones directing me towards Ollama. I am a total newbie, so I just followed their advice. The flash was based on the latest version (so thats 3 days ago), including OpenClaw pre-installed. The models that came was downloaded by this USB flash that Nvidia used, and it included Nemotron 3-super, Qwen 3.5 and another Qwen model I don’t remember the name of right now. In the startup text file that was on the desktop they recommended to start with Nemotron 3- super. Its a 120 billion model and should fit just fine according to what they said. Trying out stuff in the beginning I noticed it was quite slow in responding so I switched to Qwen 3.5. I have used that repeatedly for days now, and it has crashed constantly with the Qwen model running. Just at the last crash (that’s part of the log file I let Claude analyze) I had loaded Nemotron, and that also crashed similarly to Qwen.
I edited the original, it was not 256B but 120B (sorry about that)
Got it, understood. There are good resources here to get the best out of your spark, so welcome!
Ruling out OOM issues, I will tell you there are numerous reports (including my own) of machines shutting down without warning under load. I have two machines and one of them will only run reliably when I manually reduce the maximum GPU clock. As a diagnostic step, try this:
See if you can get it to crash after applying it. This won’t persist past a reboot unless you set it up in systemd, but might tell you if you are having the same problem as myself and several others..
Following up on my earlier request — this collects the same data plus EFI pstore crash records and rasdaemon hardware errors in a single run:
The verify step confirms the output is safe to share. If you’re able to run it, the .txt.gz (and .json if generated) would help narrow down where the failure is occurring.
Ollama has its place. It’s by far the quickest and easiest to get up and running for beginners, compared to llama.cpp and vllm. For beginners it’s not a bad place to start and the performance hit is far less than it used to be compared to the other two. One thing of note specific to the Spark, though, is that Ollama should be set to not launch on system startup because some of the vllm recipes for the larger models need every single bit of available RAM.