[Root Cause Analysis] DGX Spark driver failure — kernel 6.17.0-1008-nvidia aarch64 panics cause DOE mailbox failure (pstore evidence)

Title: [Root Cause Analysis] DOE mailbox timeout + persistent driver failure on DGX Spark — kernel 6.17.0-1008-nvidia has documented aarch64 kernel panics (EFI pstore evidence)


Disclosure: I don’t own a DGX Spark. Everything in this post is based on logs, crash records, and sosreports shared by community members. I’m contributing forensic analysis from the outside because understanding what’s happening under the hood benefits everyone — users, production engineers, and NVIDIA alike. The DGX Spark is a powerful machine. The more we understand its failure modes from primary evidence, the better tools and workflows we can build around it.


I have been doing forensic analysis of a DGX Spark failure case contributed by a community member who shared their full sosreport and field diagnostic logs. The system has been failing on every boot since February 13, 2026 — 31 days, 40+ distinct boot entries documented in the journalctl unit log, all showing Failed to query NVIDIA devices within seconds of boot. No resolution.

I believe the root cause affects any DGX Spark running kernel 6.17.0-1008-nvidia, which shipped as part of the DGX OS 7.4.0 OTA update around February 12-13, 2026.

This analysis is also directly relevant to @henriko’s thread: DGX Spark — 9+ silent crashes in one day, PCI DOE mailbox timeout on every boot, unit purchased 3 days ago

In that thread, @trystan1 stated the DOE mailbox errors are “normal.” Based on primary source evidence, they are not normal on a healthy cold boot — they are the consequence of a prior kernel panic leaving the Grace firmware in a corrupted state.


Hardware (confirmed from sosreport DMI data)

  • DGX Spark, GB10 SM12.1

  • Running BIOS: 5.36_0ACUM018 (August 2025)

  • Installed BIOS per firmware inventory: 5.36_0ACUM023 (December 2025)

  • Driver 580.126.09, CUDA 13.0, kernel 6.17.0-1008-nvidia

  • SK Hynix LPDDR5 128GB @ 8533 MT/s


Root cause — kernel upgrade on February 13

Boot symlink timestamps from the sosreport confirm the failure began exactly when 6.17.0-1008-nvidia became the default boot kernel:

Feb 13 09:46 — vmlinuz → vmlinuz-6.17.0-1008-nvidia  (symlink updated)
Feb 13 09:50 — nvidia-persistenced: Failed to query NVIDIA devices

Four minutes between kernel switch and first driver failure. The nvidia-persistenced journal confirms the last healthy boot was December 29 on 6.14.0-1013-nvidia, with NUMA memory onlined and the device registered successfully. Every boot from February 13 onward failed immediately — 40+ distinct boot entries documented in the journalctl unit log, all on 6.17.0-1008-nvidia, all showing Failed to query NVIDIA devices within seconds of boot.

The previous kernel 6.14.0-1013-nvidia is still installed on this system and has never been tried since the switch. It is bootable from the grub menu right now.

This kernel was pushed as part of the DGX OS 7.4.0 OTA update. Community member mmos confirmed receiving it via apt upgrade on February 12: Ubuntu 26.04 LTS (Kernel: 6.17.0) ARM64 on DGX Spark, anyone?

The DGX OS 7.4.0 release notes confirm 6.17.0-1008-nvidia as the official DGX Spark kernel for that release: New DGX OS 7.4.0

Any DGX Spark that updated in that window received this kernel as the new default.


Three kernel bugs identified from EFI pstore crash records

The sosreport contains EFI pstore crash records from multiple boots. These are crash logs written to non-volatile storage before the kernel died — they survive across reboots and tell us exactly what happened before the DOE failure state was established.

Bug 1 — nbcon console stack overflow (idle CPU, ~2.5 hours uptime)

[9351.883486] pc : nbcon_get_cpu_emergency_nesting+0x10/0x80
[9351.883491] lr : nbcon_get_default_prio+0x2c/0x60
[9351.883496] sp : ffff8000801d8000   ← exact stack bottom
[9351.883470] Insufficient stack space to handle exception!
[9351.883476] FAR: 0xffff8000801d7ff0  ← 16 bytes below stack
[9351.883476] Task stack: [0xffff8000801d8000..0xffff8000801dc000]
[9351.883507] Kernel panic - not syncing: kernel stack overflow

CPU 7, PID 0 (swapper/7 — idle task). No call trace recoverable — the stack was already exhausted at the time of the overflow. The nbcon non-blocking console subsystem recursed into itself during interrupt handling on an idle CPU until the stack hit the guard page.

Bug 2 — FPAC/PSCI/NMI race condition (396 seconds uptime)

Full call trace recovered from pstore:

[396.369993] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
[396.387506] CPU: 7 PID: 0 Comm: swapper/7
[396.401830]  ct_nmi_enter+0x90/0xf8 (P)   ← recursive ×4
[396.410682]  ct_kernel_enter.isra.0+0xb8/0xe0 (P)
[396.411643]  psci_cpu_suspend_enter+0xb0/0x118
[396.412258]  acpi_idle_lpi_enter+0xbc/0xd0
[396.412778]  cpuidle_enter_state+0x98/0x720
[396.414278]  do_idle+0x108/0x120
[396.416271] Kernel panic - not syncing: Attempted to kill the idle task!

CPU 7 was entering an ACPI LPI C-state via PSCI when an NMI fired. The NMI context tracking function ct_nmi_enter called itself recursively — the PAC-tagged link register lr: 0xca00a00936b5309c failed ARM Pointer Authentication, triggering an FPAC fault. This is a race condition between PSCI idle entry and NMI delivery on aarch64 with Pointer Authentication enabled. The dmesg confirms: PSCIv1.1 detected in firmware, SMC Calling Convention v1.5.

Bug 3 — qspinlock IOVA hash overflow during NVMe writeback (777 seconds uptime)

[777.095250] UBSAN: array-index-out-of-bounds in qspinlock.h:68:9
[777.096789] index 11548 is out of range for type 'long unsigned int [512]'
[777.097664] Workqueue: writeback wb_workfn (flush-259:0)
[777.097665] Call trace:
              queued_spin_lock_slowpath+0x488/0x4b0
              _raw_spin_lock_irqsave
              alloc_iova_fast
              iommu_dma_alloc_iova
              nvme_prep_rq [nvme]
              nvme_queue_rqs [nvme]
              wb_workfn
[777.400426] Tainted: [D]=DIE, [O]=OOT_MODULE
[781.104235] Kernel panic - not syncing: SBSA Generic Watchdog timeout

The IOMMU DMA IOVA spinlock hash table was indexed at 11548 against a 512-entry array. Escalated to [D]=DIE taint. The SBSA hardware watchdog fired 4 seconds later because CPU 7 could not be stopped. The kernel command line includes iommu.passthrough=0 — IOMMU translation is active on this system, making this path trigger during NVMe writeback operations.

Taint source in all panics: mstflint_access(O) — the Mellanox firmware access tool. The NVIDIA driver does not appear in any call trace. These are kernel-level bugs in 6.17.0-1008-nvidia, not NVIDIA driver bugs.


How kernel panics cause the DOE mailbox failure

Each kernel panic causes the Grace CPU firmware (MediaTek MTKID) to log a BERT (Boot Error Record Table) hardware error. After a BERT-triggered platform reset, the next boot shows this at kernel second 0 during PCIe enumeration:

platform NVDA8800:00: failed to claim resource 0: [mem 0x05170000-0x051cffff]
acpi NVDA8800:00: platform device creation failed: -16
platform NVDA8900:00: failed to claim resource 0: [mem 0xc8000000-0xd7ffffff]
acpi NVDA8900:00: platform device creation failed: -16
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command: -5
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link

NVDA8800 and NVDA8900 are the NVLink-C2C management platform devices — they control the coherent interconnect between the Grace CPU and the GB10 GPU. The firmware leaves their memory regions marked EBUSY after the BERT reset. The C2C initialization chain fails first; the DOE mailbox failure is a consequence of that, not an independent event.

Confirmed from lspci on the failed system at capture time:

DOESta: Busy+ IntSta+ Error+ ObjectReady-
LnkSta: Speed unknown, Width x0
Control: BusMaster-

Zero PCIe bandwidth. No NVIDIA kernel modules load. The system is fully alive — SSH works, all non-GPU services run, memory pressure is zero. Only the GPU driver stack is dead.

Warm reboots do not recover this state. Two consecutive warm reboots are documented in the logs, both showing identical DOE failure. PCIe link training only happens at power-on. The DOE mailbox stuck state is firmware-level hardware state that only clears when power is removed from the wall.


Field diagnostic result does not rule out this failure

The NVIDIA field diagnostic tool (dgx-spark-fieldiag r9.257.3) uses a proprietary mods kernel driver (v4.31) that bypasses the normal production driver initialization chain. It does not go through the DOE mailbox or the ACPI NVDA8800/NVDA8900 platform device setup.

One week after the driver had already been failing on every boot, all 8 field diagnostic tests passed clean on this system: GpuStress (199 seconds), C2CStress (6.9 seconds), PowerStress (489 seconds), ThermalStress, MemStress, FioSSD, CpuStress1, CpuStress2. Hardware confirmed fully healthy.

A clean field diagnostic result does not mean the system is healthy for the production driver stack. The mods diagnostic driver accesses GPU hardware through a completely different initialization path that is not affected by DOE mailbox state. This is an important distinction for anyone troubleshooting persistent driver failure.


BIOS version mismatch

DMI firmware inventory from the affected system:

  • Running BIOS: 5.36_0ACUM018 (August 6, 2025)

  • Installed BIOS per DMI firmware inventory: 5.36_0ACUM023 (December 22, 2025)

Kernel 6.17.0-1008-nvidia was built January 21, 2026 — after the December BIOS update. The system is running a BIOS from August 2025 while the kernel was built after the December 2025 update. The PSCI/NMI interaction in Bug 2 may be sensitive to this version difference — specifically around how PSCIv1.1 handles CPU C-state entry and NMI delivery on this firmware version.


Regarding @henriko’s thread

@henriko reported identical DOE mailbox errors on a brand new unit purchased at GTC, also running 6.17.0-1008-nvidia: DGX Spark — 9+ silent crashes in one day, PCI DOE mailbox timeout on every boot, unit purchased 3 days ago

platform NVDA8800:00: failed to claim resource 0
platform NVDA8900:00: platform device creation failed: -16
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5

These are identical to what appears in the logs analyzed above after a kernel panic cascade. On a healthy system after a clean cold boot these errors should not appear — NVDA8800/NVDA8900 initialize cleanly and the DOE mailbox creates successfully. When they appear on every boot it means the firmware was left in a corrupted state. Both systems are running 6.17.0-1008-nvidia.


Immediate recovery path

For any DGX Spark in this failure state with 6.14.0-1013-nvidia still installed:

  1. If DOE stuck state is present: cold power cycle first — shut down OS, disconnect from wall, wait 60 seconds, reconnect. Warm reboots will not help.

  2. Reboot and hold Shift or Esc to access the grub menu.

  3. Select “Advanced options for Ubuntu.”

  4. Select 6.14.0-1013-nvidia.

  5. After successful boot, do not run apt upgrade until a fixed kernel is available.


Questions for NVIDIA / Ubuntu kernel team

  1. Is the PSCI idle + NMI + ct_nmi_enter recursion race condition on aarch64 known in 6.17.0-1008-nvidia? Is there a patch or workaround in progress?

  2. Is the qspinlock IOVA hash table index overflow in alloc_iova_fast during NVMe writeback known?

  3. Does 6.17.0-1008-nvidia have known incompatibilities with BIOS 5.36_0ACUM018?

  4. Is there a mechanism to prevent the DOE mailbox stuck state from persisting across warm reboots after a BERT reset?

  5. Should 6.17.0-1008-nvidia be pulled from the OTA channel pending a patch, or is 6.14.0-1013-nvidia a safe rollback target for affected systems?


About this analysis

I don’t have a DGX Spark. These findings are entirely from primary source evidence — EFI pstore crash records, journalctl boot history, dmesg, lspci, dmidecode, and field diagnostic logs shared by a DGX community member. No speculation.

There are a lot of brilliant minds in this community. Together — users, production engineers, and contributors working from logs and crash reports — we can get full capabilities out of the DGX Spark and build better diagnostic workflows for everyone. Having hardware access makes hands-on contribution possible at a different depth, and that only makes the analysis more useful to the community as a whole.

Two open-source tools relevant to DGX Spark diagnostics:

spark-gpu-throttle-check — originally built by @hoesing to address the 513MHz GPU clock collapse issue documented here: Investigating 513MHz cap for GPU

I forked it and added GB10-specific clock validation and load adequacy gates. All credit to the original work. The tool helps confirm whether a DGX Spark is in a healthy power state before attributing problems to software: GitHub - parallelArchitect/spark-gpu-throttle-check: Enhanced GPU throttle diagnostic for DGX Spark (GB10): NVML direct telemetry, throttle cause decoder, PCIe link monitoring, baseline drift detection, timeline capture. · GitHub

cuda-unified-memory-analyzer — UMA memory pressure diagnostics for GB10 unified memory architecture and discrete GPU platforms. Built to measure what’s actually happening in the unified memory pool under workload: GitHub - parallelArchitect/cuda-unified-memory-analyzer: gpu thrashingNVIDIA GPU Unified Memory diagnostic tool — architecture-aware, measurement-based, PCIe/coherent transport detection · GitHub

parallelArchitect


Update — Firmware update timeline and lspci confirmation from sosreport

Deeper analysis of the sosreport reveals additional primary source evidence.

Firmware update timeline — fwupd history

Two firmware updates were applied on February 18, 2026 — five days after the kernel panics began:

Embedded Controller: 0x02004b03 → 0x02004e12  (urgency: High, lvfs)
SoC Firmware:        0x02009009 → 0x02009418  (urgency: High, lvfs)

Both updates were applied after the kernel panic cascade had already been established — and the system continued failing through March 16 when the sosreport was taken. The EC and SoC firmware updates did not resolve the DOE mailbox failure. The kernel 6.17.0-1008-nvidia remains the primary trigger.

This also confirms the BIOS mismatch documented above. At the time the crashes began the Embedded Controller was at 0x02004b03. The update to 0x02004e12 arrived five days later and did not restore normal boot behavior.

GPU PCIe state confirmed from lspci at sosreport capture — March 16

000f:01:00.0 VGA compatible controller — NVIDIA GB10
LnkSta:  Speed 2.5GT/s, Width x1 (downgraded)
Control: BusMaster-
DevSta:  CorrErr+, UnsupReq+
MSI:     Enable- Address: 0000000000000000
CommClk-

The GPU link had degraded to PCIe Gen1 x1 — downgraded from the expected operating speed. BusMaster disabled — the GPU cannot initiate DMA. MSI interrupt vector uninitialized — no interrupts possible. This confirms the driver was completely dead at sosreport capture time, consistent with the DOE mailbox stuck state documented in the original post.

nvidia-persistenced confirmed failed at exactly 13:13:14 on March 16 — same Failed to query NVIDIA devices signature as every boot since February 13.

PD firmware observation

No PD firmware entry appears in fwupd on this unit (spark-c03d, NVIDIA FE SKU). @eugr (Spark Expert) observed PD firmware FW1/FW2 slot versions in fwupd on other affected units in the thermal shutdown thread: DGXSPARK temperature too high, automatic shutdown。

Spark3 (new GTC unit):     PD Firmware FW1: 5.7, FW2: 4.10
Spark1/Spark2 (Oct-Nov):   PD Firmware FW1: 5.7, FW2: 5.7

Whether the absence here reflects a SKU difference or a consequence of the DOE failure state preventing fwupd enumeration is not determined from this data alone.

Update — April 14, 2026

The zombie/hang behavior reported in PyTorch issue
[Memory] Unbounded allocation on NVIDIA DGX (Unified Memory) causes system hang instead of OOM #174358
is resolved on the following stack:

  • Kernel 6.17
  • NVIDIA driver 580.142
  • DGX OS 7.4.0

Units still experiencing hangs should verify they are running this stack before further diagnostics:

Additional finding — incorrect memory reporting in monitoring tools on GB10 systems

Across multiple community reports, monitoring tools use MemTotal (~121 GB) as the denominator instead of allocatable memory.

On coherent UMA platforms, MemTotal does not represent usable capacity. MemAvailable is a closer approximation, as it reflects kernel reservations and page cache state.

This produces misleading utilization readings. In one observed case, ~92.7% memory usage was reported with GPU at 0% and CPU ~0.2%, inconsistent with real memory pressure.

This is a tool-side reporting issue, not a hardware limitation.

Related work is in progress upstream:

Field observations: