DGX Spark — 9+ silent crashes in one day, PCI DOE mailbox timeout on every boot, unit purchased 3 days ago

henriko · March 24, 2026, 12:06am

Hi,

I purchased a DGX Spark approximately 3 days ago (I bought it onsite at GTC on Friday, in the Build Your Claw tent) and I am experiencing severe and escalating instability. I have had 9 confirmed silent crashes today alone (March 23), and additional crashes on previous days. I believe I have a defective unit and am looking to initiate a warranty replacement.

Crash behavior:

Screen goes black with no recovery — long pressing the power button does not help
Full power disconnection from the wall is required to get the unit back
Crashes occur both under heavy GPU load (running Nemotron 3 Super / Nemotron 3 Super via Ollama) AND at near-idle when no inference is actively running
No kernel panic, no graceful shutdown — logs end abruptly with no error trace (silent crash)

Reboot history from today (March 23) alone:

Mar 23 16:19 — still running (current)
Mar 23 15:52 — crashed after ~2 minutes
Mar 23 15:18 — crashed
Mar 23 14:54 — crashed
Mar 23 14:05 — crashed
Mar 23 13:38 — crashed
Mar 23 13:05 — crashed
Mar 23 12:33 — crashed
Mar 23 12:23 - 12:27 (3 min uptime)

Hardware errors appearing on every boot:

platform NVDA8800:00: failed to claim resource 0: [mem 0x05170000-0x051cffff]
platform NVDA8900:00: platform device creation failed: -16
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command: -5
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5

The PCI DOE mailbox timeout is present on every single boot and appears to be a hardware-level communication failure.

Setup details:

Unit is in an air conditioned room with clear airflow on all sides
Temperatures are warm to the touch but not hot, has also crashed when lukewarm
Using the original supplied power adapter
OS: Ubuntu with kernel 6.17.0-1008-nvidia
Ollama running as a background service (OpenClaw)

What I have tried:

Different power outlets — no improvement
Setting OLLAMA_KEEP_ALIVE=0 to unload models between sessions
Attempted to install nvidia-field-diag but hit a keyring conflict with the existing CUDA repo — happy to run diagnostics if NVIDIA support can advise the correct install method for the DGX Spark

I have full journalctl logs from the crashed sessions available to share. All crashes are confirmed silent — logs end abruptly with no error message.

Given that this unit is 3 days old and crashing 9+ times per day including at idle, I believe this is a defective unit and would like to proceed with a replacement.

Tagging @NVES as I have seen you assist with similar hardware cases on this forum.

Thank you

trystan1 · March 24, 2026, 12:31am

The dmesg output you posted is normal, I see it on all of mine that are happily humming along.

Start with a bios ‘load defaults’ and an image from scratch:

System Recovery — DGX Spark User Guide

That link has the latest recovery image that will automatically push firmware/low level patches on install.

josephbreda · March 24, 2026, 12:44am

Are you sure you not causing an OOM condition? That’s too big of a model to run on a a single Spark. And why, for the love of god, do people continue to insist on running Ollama!

henriko · March 24, 2026, 12:55am

It was flashed from USB with the newest setup from NVIDIA at the GTC event. Nvidia employees set it up for me and they were the ones directing me towards Ollama. I am a total newbie, so I just followed their advice. The flash was based on the latest version (so thats 3 days ago), including OpenClaw pre-installed. The models that came was downloaded by this USB flash that Nvidia used, and it included Nemotron 3-super, Qwen 3.5 and another Qwen model I don’t remember the name of right now. In the startup text file that was on the desktop they recommended to start with Nemotron 3- super. Its a 120 billion model and should fit just fine according to what they said. Trying out stuff in the beginning I noticed it was quite slow in responding so I switched to Qwen 3.5. I have used that repeatedly for days now, and it has crashed constantly with the Qwen model running. Just at the last crash (that’s part of the log file I let Claude analyze) I had loaded Nemotron, and that also crashed similarly to Qwen.

I edited the original, it was not 256B but 120B (sorry about that)

josephbreda · March 24, 2026, 1:52am

Got it, understood. There are good resources here to get the best out of your spark, so welcome!

Ruling out OOM issues, I will tell you there are numerous reports (including my own) of machines shutting down without warning under load. I have two machines and one of them will only run reliably when I manually reduce the maximum GPU clock. As a diagnostic step, try this:

sudo nvidia-smi -pm 1
sudo nvidia-smi -lgc 300,2100

See if you can get it to crash after applying it. This won’t persist past a reboot unless you set it up in systemd, but might tell you if you are having the same problem as myself and several others..

Let me know how it goes.

parallelArchitect · April 14, 2026, 7:09am

You mentioned having full journalctl logs from the crashed sessions. If you’re willing to share them, I can review them.

parallelArchitect · April 14, 2026, 12:48pm

Following up on my earlier request — this collects the same data plus EFI pstore crash records and rasdaemon hardware errors in a single run:

The verify step confirms the output is safe to share. If you’re able to run it, the .txt.gz (and .json if generated) would help narrow down where the failure is occurring.

NVES · April 14, 2026, 1:53pm

Hi Henriko, can you please run NVIDIA DGX Spark Field Diagnostics | NVIDIA , then DM me the resulting log bundle? We can decide on next steps pending those diags.

aostang · April 14, 2026, 2:43pm

Ollama has its place. It’s by far the quickest and easiest to get up and running for beginners, compared to llama.cpp and vllm. For beginners it’s not a bad place to start and the performance hit is far less than it used to be compared to the other two. One thing of note specific to the Spark, though, is that Ollama should be set to not launch on system startup because some of the vllm recipes for the larger models need every single bit of available RAM.

Topic		Replies	Views
Warranty Claim - DGX Spark Unit Defective DGX Spark / GB10	4	371	March 3, 2026
DGX Spark (GB10) reproducibly hard powers-off under GPU load — fully updated, zero crash capture DGX Spark / GB10 boot , kernel , ota	13	344	June 14, 2026
Random reboots and 00 screen DGX Spark / GB10 reboot	46	1536	April 20, 2026
DGX spark keeps rebooting every 20-30 minutes DGX Spark / GB10 boot	6	1252	March 19, 2026
DGX Spark crashes immediately after login due to Docker container auto-restart DGX Spark / GB10 boot	3	189	February 4, 2026
DGX Spark Completely Inoperable - Need Help (USB Boot Fails, UEFI Inaccessible, System Frozen) DGX Spark / GB10 deepseek	7	593	January 31, 2026
DGX Spark shutting down under load - MODS-020000600139 DGX Spark / GB10 rma , thermal	8	387	June 8, 2026
DGX Spark - Persistent 30-Minute Restart After ALL Firmware Updates DGX Spark / GB10 ota , dgx	2	263	March 22, 2026
DGX Spark Shutdown around 95°C during nanoChat Pretraining (20-30 min) DGX Spark / GB10	21	1757	March 23, 2026
[Root Cause Analysis] DGX Spark driver failure — kernel 6.17.0-1008-nvidia aarch64 panics cause DOE mailbox failure (pstore evidence) DGX Spark / GB10 pcie , boot , kernel , ota , driver	3	456	April 17, 2026

DGX Spark — 9+ silent crashes in one day, PCI DOE mailbox timeout on every boot, unit purchased 3 days ago

Related topics