Title: [Root Cause Analysis] DOE mailbox timeout + persistent driver failure on DGX Spark — kernel 6.17.0-1008-nvidia has documented aarch64 kernel panics (EFI pstore evidence)
Disclosure: I don’t own a DGX Spark. Everything in this post is based on logs, crash records, and sosreports shared by community members. I’m contributing forensic analysis from the outside because understanding what’s happening under the hood benefits everyone — users, production engineers, and NVIDIA alike. The DGX Spark is a powerful machine. The more we understand its failure modes from primary evidence, the better tools and workflows we can build around it.
I have been doing forensic analysis of a DGX Spark failure case contributed by a community member who shared their full sosreport and field diagnostic logs. The system has been failing on every boot since February 13, 2026 — 31 days, 40+ distinct boot entries documented in the journalctl unit log, all showing Failed to query NVIDIA devices within seconds of boot. No resolution.
I believe the root cause affects any DGX Spark running kernel 6.17.0-1008-nvidia, which shipped as part of the DGX OS 7.4.0 OTA update around February 12-13, 2026.
This analysis is also directly relevant to @henriko’s thread: DGX Spark — 9+ silent crashes in one day, PCI DOE mailbox timeout on every boot, unit purchased 3 days ago
In that thread, @trystan1 stated the DOE mailbox errors are “normal.” Based on primary source evidence, they are not normal on a healthy cold boot — they are the consequence of a prior kernel panic leaving the Grace firmware in a corrupted state.
Hardware (confirmed from sosreport DMI data)
-
DGX Spark, GB10 SM12.1
-
Running BIOS: 5.36_0ACUM018 (August 2025)
-
Installed BIOS per firmware inventory: 5.36_0ACUM023 (December 2025)
-
Driver 580.126.09, CUDA 13.0, kernel 6.17.0-1008-nvidia
-
SK Hynix LPDDR5 128GB @ 8533 MT/s
Root cause — kernel upgrade on February 13
Boot symlink timestamps from the sosreport confirm the failure began exactly when 6.17.0-1008-nvidia became the default boot kernel:
Feb 13 09:46 — vmlinuz → vmlinuz-6.17.0-1008-nvidia (symlink updated)
Feb 13 09:50 — nvidia-persistenced: Failed to query NVIDIA devices
Four minutes between kernel switch and first driver failure. The nvidia-persistenced journal confirms the last healthy boot was December 29 on 6.14.0-1013-nvidia, with NUMA memory onlined and the device registered successfully. Every boot from February 13 onward failed immediately — 40+ distinct boot entries documented in the journalctl unit log, all on 6.17.0-1008-nvidia, all showing Failed to query NVIDIA devices within seconds of boot.
The previous kernel 6.14.0-1013-nvidia is still installed on this system and has never been tried since the switch. It is bootable from the grub menu right now.
This kernel was pushed as part of the DGX OS 7.4.0 OTA update. Community member mmos confirmed receiving it via apt upgrade on February 12: Ubuntu 26.04 LTS (Kernel: 6.17.0) ARM64 on DGX Spark, anyone?
The DGX OS 7.4.0 release notes confirm 6.17.0-1008-nvidia as the official DGX Spark kernel for that release: New DGX OS 7.4.0
Any DGX Spark that updated in that window received this kernel as the new default.
Three kernel bugs identified from EFI pstore crash records
The sosreport contains EFI pstore crash records from multiple boots. These are crash logs written to non-volatile storage before the kernel died — they survive across reboots and tell us exactly what happened before the DOE failure state was established.
Bug 1 — nbcon console stack overflow (idle CPU, ~2.5 hours uptime)
[9351.883486] pc : nbcon_get_cpu_emergency_nesting+0x10/0x80
[9351.883491] lr : nbcon_get_default_prio+0x2c/0x60
[9351.883496] sp : ffff8000801d8000 ← exact stack bottom
[9351.883470] Insufficient stack space to handle exception!
[9351.883476] FAR: 0xffff8000801d7ff0 ← 16 bytes below stack
[9351.883476] Task stack: [0xffff8000801d8000..0xffff8000801dc000]
[9351.883507] Kernel panic - not syncing: kernel stack overflow
CPU 7, PID 0 (swapper/7 — idle task). No call trace recoverable — the stack was already exhausted at the time of the overflow. The nbcon non-blocking console subsystem recursed into itself during interrupt handling on an idle CPU until the stack hit the guard page.
Bug 2 — FPAC/PSCI/NMI race condition (396 seconds uptime)
Full call trace recovered from pstore:
[396.369993] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
[396.387506] CPU: 7 PID: 0 Comm: swapper/7
[396.401830] ct_nmi_enter+0x90/0xf8 (P) ← recursive ×4
[396.410682] ct_kernel_enter.isra.0+0xb8/0xe0 (P)
[396.411643] psci_cpu_suspend_enter+0xb0/0x118
[396.412258] acpi_idle_lpi_enter+0xbc/0xd0
[396.412778] cpuidle_enter_state+0x98/0x720
[396.414278] do_idle+0x108/0x120
[396.416271] Kernel panic - not syncing: Attempted to kill the idle task!
CPU 7 was entering an ACPI LPI C-state via PSCI when an NMI fired. The NMI context tracking function ct_nmi_enter called itself recursively — the PAC-tagged link register lr: 0xca00a00936b5309c failed ARM Pointer Authentication, triggering an FPAC fault. This is a race condition between PSCI idle entry and NMI delivery on aarch64 with Pointer Authentication enabled. The dmesg confirms: PSCIv1.1 detected in firmware, SMC Calling Convention v1.5.
Bug 3 — qspinlock IOVA hash overflow during NVMe writeback (777 seconds uptime)
[777.095250] UBSAN: array-index-out-of-bounds in qspinlock.h:68:9
[777.096789] index 11548 is out of range for type 'long unsigned int [512]'
[777.097664] Workqueue: writeback wb_workfn (flush-259:0)
[777.097665] Call trace:
queued_spin_lock_slowpath+0x488/0x4b0
_raw_spin_lock_irqsave
alloc_iova_fast
iommu_dma_alloc_iova
nvme_prep_rq [nvme]
nvme_queue_rqs [nvme]
wb_workfn
[777.400426] Tainted: [D]=DIE, [O]=OOT_MODULE
[781.104235] Kernel panic - not syncing: SBSA Generic Watchdog timeout
The IOMMU DMA IOVA spinlock hash table was indexed at 11548 against a 512-entry array. Escalated to [D]=DIE taint. The SBSA hardware watchdog fired 4 seconds later because CPU 7 could not be stopped. The kernel command line includes iommu.passthrough=0 — IOMMU translation is active on this system, making this path trigger during NVMe writeback operations.
Taint source in all panics: mstflint_access(O) — the Mellanox firmware access tool. The NVIDIA driver does not appear in any call trace. These are kernel-level bugs in 6.17.0-1008-nvidia, not NVIDIA driver bugs.
How kernel panics cause the DOE mailbox failure
Each kernel panic causes the Grace CPU firmware (MediaTek MTKID) to log a BERT (Boot Error Record Table) hardware error. After a BERT-triggered platform reset, the next boot shows this at kernel second 0 during PCIe enumeration:
platform NVDA8800:00: failed to claim resource 0: [mem 0x05170000-0x051cffff]
acpi NVDA8800:00: platform device creation failed: -16
platform NVDA8900:00: failed to claim resource 0: [mem 0xc8000000-0xd7ffffff]
acpi NVDA8900:00: platform device creation failed: -16
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command: -5
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link
NVDA8800 and NVDA8900 are the NVLink-C2C management platform devices — they control the coherent interconnect between the Grace CPU and the GB10 GPU. The firmware leaves their memory regions marked EBUSY after the BERT reset. The C2C initialization chain fails first; the DOE mailbox failure is a consequence of that, not an independent event.
Confirmed from lspci on the failed system at capture time:
DOESta: Busy+ IntSta+ Error+ ObjectReady-
LnkSta: Speed unknown, Width x0
Control: BusMaster-
Zero PCIe bandwidth. No NVIDIA kernel modules load. The system is fully alive — SSH works, all non-GPU services run, memory pressure is zero. Only the GPU driver stack is dead.
Warm reboots do not recover this state. Two consecutive warm reboots are documented in the logs, both showing identical DOE failure. PCIe link training only happens at power-on. The DOE mailbox stuck state is firmware-level hardware state that only clears when power is removed from the wall.
Field diagnostic result does not rule out this failure
The NVIDIA field diagnostic tool (dgx-spark-fieldiag r9.257.3) uses a proprietary mods kernel driver (v4.31) that bypasses the normal production driver initialization chain. It does not go through the DOE mailbox or the ACPI NVDA8800/NVDA8900 platform device setup.
One week after the driver had already been failing on every boot, all 8 field diagnostic tests passed clean on this system: GpuStress (199 seconds), C2CStress (6.9 seconds), PowerStress (489 seconds), ThermalStress, MemStress, FioSSD, CpuStress1, CpuStress2. Hardware confirmed fully healthy.
A clean field diagnostic result does not mean the system is healthy for the production driver stack. The mods diagnostic driver accesses GPU hardware through a completely different initialization path that is not affected by DOE mailbox state. This is an important distinction for anyone troubleshooting persistent driver failure.
BIOS version mismatch
DMI firmware inventory from the affected system:
-
Running BIOS:
5.36_0ACUM018(August 6, 2025) -
Installed BIOS per DMI firmware inventory:
5.36_0ACUM023(December 22, 2025)
Kernel 6.17.0-1008-nvidia was built January 21, 2026 — after the December BIOS update. The system is running a BIOS from August 2025 while the kernel was built after the December 2025 update. The PSCI/NMI interaction in Bug 2 may be sensitive to this version difference — specifically around how PSCIv1.1 handles CPU C-state entry and NMI delivery on this firmware version.
Regarding @henriko’s thread
@henriko reported identical DOE mailbox errors on a brand new unit purchased at GTC, also running 6.17.0-1008-nvidia: DGX Spark — 9+ silent crashes in one day, PCI DOE mailbox timeout on every boot, unit purchased 3 days ago
platform NVDA8800:00: failed to claim resource 0
platform NVDA8900:00: platform device creation failed: -16
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
These are identical to what appears in the logs analyzed above after a kernel panic cascade. On a healthy system after a clean cold boot these errors should not appear — NVDA8800/NVDA8900 initialize cleanly and the DOE mailbox creates successfully. When they appear on every boot it means the firmware was left in a corrupted state. Both systems are running 6.17.0-1008-nvidia.
Immediate recovery path
For any DGX Spark in this failure state with 6.14.0-1013-nvidia still installed:
-
If DOE stuck state is present: cold power cycle first — shut down OS, disconnect from wall, wait 60 seconds, reconnect. Warm reboots will not help.
-
Reboot and hold Shift or Esc to access the grub menu.
-
Select “Advanced options for Ubuntu.”
-
Select
6.14.0-1013-nvidia. -
After successful boot, do not run
apt upgradeuntil a fixed kernel is available.
Questions for NVIDIA / Ubuntu kernel team
-
Is the PSCI idle + NMI +
ct_nmi_enterrecursion race condition on aarch64 known in6.17.0-1008-nvidia? Is there a patch or workaround in progress? -
Is the
qspinlockIOVA hash table index overflow inalloc_iova_fastduring NVMe writeback known? -
Does
6.17.0-1008-nvidiahave known incompatibilities with BIOS5.36_0ACUM018? -
Is there a mechanism to prevent the DOE mailbox stuck state from persisting across warm reboots after a BERT reset?
-
Should
6.17.0-1008-nvidiabe pulled from the OTA channel pending a patch, or is6.14.0-1013-nvidiaa safe rollback target for affected systems?
About this analysis
I don’t have a DGX Spark. These findings are entirely from primary source evidence — EFI pstore crash records, journalctl boot history, dmesg, lspci, dmidecode, and field diagnostic logs shared by a DGX community member. No speculation.
There are a lot of brilliant minds in this community. Together — users, production engineers, and contributors working from logs and crash reports — we can get full capabilities out of the DGX Spark and build better diagnostic workflows for everyone. Having hardware access makes hands-on contribution possible at a different depth, and that only makes the analysis more useful to the community as a whole.
Two open-source tools relevant to DGX Spark diagnostics:
spark-gpu-throttle-check — originally built by @hoesing to address the 513MHz GPU clock collapse issue documented here: Investigating 513MHz cap for GPU
I forked it and added GB10-specific clock validation and load adequacy gates. All credit to the original work. The tool helps confirm whether a DGX Spark is in a healthy power state before attributing problems to software: GitHub - parallelArchitect/spark-gpu-throttle-check: Enhanced GPU throttle diagnostic for DGX Spark (GB10): NVML direct telemetry, throttle cause decoder, PCIe link monitoring, baseline drift detection, timeline capture. · GitHub
cuda-unified-memory-analyzer — UMA memory pressure diagnostics for GB10 unified memory architecture and discrete GPU platforms. Built to measure what’s actually happening in the unified memory pool under workload: GitHub - parallelArchitect/cuda-unified-memory-analyzer: gpu thrashingNVIDIA GPU Unified Memory diagnostic tool — architecture-aware, measurement-based, PCIe/coherent transport detection · GitHub
parallelArchitect