I am writing to request urgent RMA authorization for my DGX Spark system due to a persistent hardware defect causing automatic restarts every 30 minutes, even after applying all available firmware updates.
SYSTEM INFORMATION
Product: NVIDIA DGX Spark
GPU: NVIDIA GB10 (UUID: GPU-341e6151-e3db-e319-260e-64e5ff978401)
BIOS Version: 5.36_0ACUM018 (Release Date: 08/06/2025)
DGX_SWBUILD: 7.2.3 (Build Date: 2025-09-10)
DGX_OTA_VERSION: 7.4.0 (OTA Date: Wed Mar 18 21:53:17 MST 2026)
Case Reference: [Previous Support Case Number if available]
ISSUE DESCRIPTION
My DGX Spark performs a hard restart every approximately 30 minutes regardless of system load. This occurs:
- At complete idle (no GPU workload)
- With normal GPU temperatures (40-42°C)
- With normal power draw (5-8W)
- Without any kernel panic or error messages in logs
- As an abrupt power loss (no graceful shutdown process)
TROUBLESHOOTING COMPLETED
I have completed extensive troubleshooting including all recommended firmware updates:
1. HARDWARE ERROR CHECKS - ALL CLEAR
$ sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.
2. THERMAL STATUS - NORMAL
GPU Temperature: 40-42°C (well below 95°C thermal limit)
GPU Power Draw: 5-8W at idle
No thermal throttling detected
3. FIRMWARE UPDATES APPLIED
A) Embedded Controller (EC) Firmware - SUCCESS
Updated: 0x00000001 → 0x02004e12 (Latest)
Description: "improves performance and stability of Embedded Controller"
Result: Update successful, system restarted as expected
B) SoC Firmware (UEFI + GPU) - SUCCESS
Updated: 0x00000001 → 0x0200941a (Latest, NVIDIA-tested 2026-03-02)
Description: "improves performance and stability including UEFI and GPU"
Result: Update successful, system restarted as expected
C) USB-C Power Delivery Firmware - FAILED
Attempted: 0x00000001 → 0x00000507
Error: "failed to download file: Could not resolve host: r2.fwupd.org"
Result: Still at factory default 0x00000001
4. INITIAL FIRMWARE BUG IDENTIFIED AND RESOLVED
BEFORE updates: ACPI "[Firmware Bug]: No valid trip points!" on all 7 thermal zones
AFTER SoC update: Thermal zones now properly configured, bug resolved
However, 30-minute restart PERSISTS despite firmware fixes.
CURRENT FIRMWARE STATUS
Component | Before Update | After Update | Status
-------------------|---------------|--------------|------------------
EC Firmware | 0x00000001 | 0x02004e12 | Latest
SoC/UEFI Firmware | 0x00000001 | 0x0200941a | Latest
USB-C PD Firmware | 0x00000001 | 0x00000001 | Failed/Blocked
ROOT CAUSE ANALYSIS
Given that:
1. All thermal metrics are normal (not overheating)
2. All hardware error checks pass (no RAM/PCIe defects)
3. EC firmware updated (watchdog/power management should be fixed)
4. SoC firmware updated (ACPI/thermal management fixed)
5. USB-C PD firmware fails to update (possible hardware indicator)
6. 30-minute restart PERSISTS after all software fixes
CONCLUSION: Hardware defect in power management subsystem (PMIC), thermal
management hardware, or mainboard requiring physical replacement.
COMPARISON TO KNOWN ISSUES
This matches documented cases in NVIDIA Developer Forums:
- Case 1: DGX spark keeps rebooting every 20-30 minutes → Required RMA
- Case 2: Is This a Hardware Issue Requiring Repeated Shutdowns → NVIDIA
confirmed hardware defect, RMA authorized
- Multiple forum users with identical 30-minute restart pattern required
hardware replacement
REQUEST
1. IMMEDIATE RMA AUTHORIZATION for my DGX Spark
2. PRIORITY HANDLING as system is completely unusable for any workload
sounds like you’re watchdogging
I’d rule that out first before you talk to the manager.
Sounds like you’re hitting the hardware watchdog. make sure you didn’t disable the software ‘kick’ by accident.