DGX Spark - Persistent 30-Minute Restart After ALL Firmware Updates

,

I am writing to request urgent RMA authorization for my DGX Spark system due to a persistent hardware defect causing automatic restarts every 30 minutes, even after applying all available firmware updates.


SYSTEM INFORMATION


Product:           NVIDIA DGX Spark
GPU:               NVIDIA GB10 (UUID: GPU-341e6151-e3db-e319-260e-64e5ff978401)
BIOS Version:      5.36_0ACUM018 (Release Date: 08/06/2025)
DGX_SWBUILD:       7.2.3 (Build Date: 2025-09-10)
DGX_OTA_VERSION:   7.4.0 (OTA Date: Wed Mar 18 21:53:17 MST 2026)
Case Reference:    [Previous Support Case Number if available]


ISSUE DESCRIPTION

My DGX Spark performs a hard restart every approximately 30 minutes regardless of system load. This occurs:
- At complete idle (no GPU workload)
- With normal GPU temperatures (40-42°C)
- With normal power draw (5-8W)
- Without any kernel panic or error messages in logs
- As an abrupt power loss (no graceful shutdown process)


TROUBLESHOOTING COMPLETED


I have completed extensive troubleshooting including all recommended firmware updates:

1. HARDWARE ERROR CHECKS - ALL CLEAR
   
   $ sudo ras-mc-ctl --errors
   No Memory errors.
   No PCIe AER errors.
   No Extlog errors.
   No MCE errors.

2. THERMAL STATUS - NORMAL  
   GPU Temperature: 40-42°C (well below 95°C thermal limit)
   GPU Power Draw: 5-8W at idle
   No thermal throttling detected

3. FIRMWARE UPDATES APPLIED
   
   
   A) Embedded Controller (EC) Firmware - SUCCESS
      Updated: 0x00000001 → 0x02004e12 (Latest)
      Description: "improves performance and stability of Embedded Controller"
      Result: Update successful, system restarted as expected
      
   B) SoC Firmware (UEFI + GPU) - SUCCESS
      Updated: 0x00000001 → 0x0200941a (Latest, NVIDIA-tested 2026-03-02)
      Description: "improves performance and stability including UEFI and GPU"
      Result: Update successful, system restarted as expected
      
   C) USB-C Power Delivery Firmware - FAILED
      Attempted: 0x00000001 → 0x00000507
      Error: "failed to download file: Could not resolve host: r2.fwupd.org"
      Result: Still at factory default 0x00000001

4. INITIAL FIRMWARE BUG IDENTIFIED AND RESOLVED

   BEFORE updates: ACPI "[Firmware Bug]: No valid trip points!" on all 7 thermal zones
   AFTER SoC update: Thermal zones now properly configured, bug resolved
   
   However, 30-minute restart PERSISTS despite firmware fixes.


CURRENT FIRMWARE STATUS

Component          | Before Update | After Update | Status
-------------------|---------------|--------------|------------------
EC Firmware        | 0x00000001    | 0x02004e12   | Latest
SoC/UEFI Firmware  | 0x00000001    | 0x0200941a   | Latest  
USB-C PD Firmware  | 0x00000001    | 0x00000001   | Failed/Blocked

ROOT CAUSE ANALYSIS

Given that:
1. All thermal metrics are normal (not overheating)
2. All hardware error checks pass (no RAM/PCIe defects)
3. EC firmware updated (watchdog/power management should be fixed)
4. SoC firmware updated (ACPI/thermal management fixed)
5. USB-C PD firmware fails to update (possible hardware indicator)
6. 30-minute restart PERSISTS after all software fixes

CONCLUSION: Hardware defect in power management subsystem (PMIC), thermal 
management hardware, or mainboard requiring physical replacement.


COMPARISON TO KNOWN ISSUES

This matches documented cases in NVIDIA Developer Forums:
- Case 1: DGX spark keeps rebooting every 20-30 minutes → Required RMA
- Case 2: Is This a Hardware Issue Requiring Repeated Shutdowns → NVIDIA 
  confirmed hardware defect, RMA authorized
- Multiple forum users with identical 30-minute restart pattern required 
  hardware replacement


REQUEST


1. IMMEDIATE RMA AUTHORIZATION for my DGX Spark

2. PRIORITY HANDLING as system is completely unusable for any workload



sounds like you’re watchdogging

I’d rule that out first before you talk to the manager.

Sounds like you’re hitting the hardware watchdog. make sure you didn’t disable the software ‘kick’ by accident.