DGX Spark — 9+ silent crashes in one day, PCI DOE mailbox timeout on every boot, unit purchased 3 days ago

Hi,

I purchased a DGX Spark approximately 3 days ago (I bought it onsite at GTC on Friday, in the Build Your Claw tent) and I am experiencing severe and escalating instability. I have had 9 confirmed silent crashes today alone (March 23), and additional crashes on previous days. I believe I have a defective unit and am looking to initiate a warranty replacement.

Crash behavior:

  • Screen goes black with no recovery — long pressing the power button does not help
  • Full power disconnection from the wall is required to get the unit back
  • Crashes occur both under heavy GPU load (running Nemotron 3 Super / Nemotron 3 Super via Ollama) AND at near-idle when no inference is actively running
  • No kernel panic, no graceful shutdown — logs end abruptly with no error trace (silent crash)

Reboot history from today (March 23) alone:

Mar 23 16:19 — still running (current)
Mar 23 15:52 — crashed after ~2 minutes
Mar 23 15:18 — crashed
Mar 23 14:54 — crashed
Mar 23 14:05 — crashed
Mar 23 13:38 — crashed
Mar 23 13:05 — crashed
Mar 23 12:33 — crashed
Mar 23 12:23 - 12:27 (3 min uptime)

Hardware errors appearing on every boot:

platform NVDA8800:00: failed to claim resource 0: [mem 0x05170000-0x051cffff]
platform NVDA8900:00: platform device creation failed: -16
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command: -5
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5

The PCI DOE mailbox timeout is present on every single boot and appears to be a hardware-level communication failure.

Setup details:

  • Unit is in an air conditioned room with clear airflow on all sides
  • Temperatures are warm to the touch but not hot, has also crashed when lukewarm
  • Using the original supplied power adapter
  • OS: Ubuntu with kernel 6.17.0-1008-nvidia
  • Ollama running as a background service (OpenClaw)

What I have tried:

  • Different power outlets — no improvement
  • Setting OLLAMA_KEEP_ALIVE=0 to unload models between sessions
  • Attempted to install nvidia-field-diag but hit a keyring conflict with the existing CUDA repo — happy to run diagnostics if NVIDIA support can advise the correct install method for the DGX Spark

I have full journalctl logs from the crashed sessions available to share. All crashes are confirmed silent — logs end abruptly with no error message.

Given that this unit is 3 days old and crashing 9+ times per day including at idle, I believe this is a defective unit and would like to proceed with a replacement.

Tagging @NVES as I have seen you assist with similar hardware cases on this forum.

Thank you

The dmesg output you posted is normal, I see it on all of mine that are happily humming along.

Start with a bios ‘load defaults’ and an image from scratch:

System Recovery — DGX Spark User Guide

That link has the latest recovery image that will automatically push firmware/low level patches on install.

Are you sure you not causing an OOM condition? That’s too big of a model to run on a a single Spark. And why, for the love of god, do people continue to insist on running Ollama!

It was flashed from USB with the newest setup from NVIDIA at the GTC event. Nvidia employees set it up for me and they were the ones directing me towards Ollama. I am a total newbie, so I just followed their advice. The flash was based on the latest version (so thats 3 days ago), including OpenClaw pre-installed. The models that came was downloaded by this USB flash that Nvidia used, and it included Nemotron 3-super, Qwen 3.5 and another Qwen model I don’t remember the name of right now. In the startup text file that was on the desktop they recommended to start with Nemotron 3- super. Its a 120 billion model and should fit just fine according to what they said. Trying out stuff in the beginning I noticed it was quite slow in responding so I switched to Qwen 3.5. I have used that repeatedly for days now, and it has crashed constantly with the Qwen model running. Just at the last crash (that’s part of the log file I let Claude analyze) I had loaded Nemotron, and that also crashed similarly to Qwen.

I edited the original, it was not 256B but 120B (sorry about that)

Got it, understood. There are good resources here to get the best out of your spark, so welcome!

Ruling out OOM issues, I will tell you there are numerous reports (including my own) of machines shutting down without warning under load. I have two machines and one of them will only run reliably when I manually reduce the maximum GPU clock. As a diagnostic step, try this:

sudo nvidia-smi -pm 1
sudo nvidia-smi -lgc 300,2100

See if you can get it to crash after applying it. This won’t persist past a reboot unless you set it up in systemd, but might tell you if you are having the same problem as myself and several others..

Let me know how it goes.


You mentioned having full journalctl logs from the crashed sessions. If you’re willing to share them, I can review them.

Following up on my earlier request — this collects the same data plus EFI pstore crash records and rasdaemon hardware errors in a single run:

The verify step confirms the output is safe to share. If you’re able to run it, the .txt.gz (and .json if generated) would help narrow down where the failure is occurring.

Hi Henriko, can you please run NVIDIA DGX Spark Field Diagnostics | NVIDIA , then DM me the resulting log bundle? We can decide on next steps pending those diags.

Ollama has its place. It’s by far the quickest and easiest to get up and running for beginners, compared to llama.cpp and vllm. For beginners it’s not a bad place to start and the performance hit is far less than it used to be compared to the other two. One thing of note specific to the Spark, though, is that Ollama should be set to not launch on system startup because some of the vllm recipes for the larger models need every single bit of available RAM.