I had purchased DGX Spark 4 months ago. I have not used in the last month and half. Now I see that i just reboots every 20 mins. I tried updating all the firmware but didn’t help.
Can I get some assistance please? Customer supported directed me to open a new topic on this forum.
Hi, I will need more information to help you out. After booting the Spark, can you login? If so, please generate an nvidia-bug-report by running in command line, nvidia-bug-report.sh. Also putjournalctl -k -b -1 -e in a log file so I can see your previous boot log and determine the shutdown cause.
To troubleshoot insufficient power issue, took the following steps. This did not address the reboot problem. any ideas on what else need to looked at? Thank you
Boot-time blacklist test:
Created this file: /etc/modprobe.d/blacklist-mlx5-test.conf
After reboot, confirmed mlx5 modules were not loaded: lsmod | grep mlx5
The command returned no output.
Confirmed current boot had no new mlx5 PCIe insufficient-power messages: journalctl -b -k --no-pager -o short-iso -g ‘Detected insufficient power on the PCIe slot’
That command returned: – No entries –
Confirmed network was still using Wi-Fi: ip route get 1.1.1.1 dev wlP9s9, src 192.168.68.90
Result:
Even with mlx5 blacklisted from boot, no mlx5 modules loaded, and no current-boot mlx5 insufficient-power messages, the system still
reset after about 19-20 minutes.
Boot history showed:
Previous boot: Sat 2026-05-02 16:12:56 EDT to Sat 2026-05-02 16:32:34 EDT
Duration: about 19 minutes 38 seconds
This matched the same hard reset interval seen before blacklisting mlx5.
Conclusion:
Disabling mlx5/Mellanox at boot did not stop the reset loop.
blacklisted mlx5 from boot and confirmed no mlx5 modules loaded, but the system still restarts at the same ~19-20 minute interval.
Current suspicious clues are BERT hardware error records from the previous boot and NVIDIA GPU PCIe/DOE/link-width anomalies on
000f:01:00.0.
journalctl --list-boots --no-pager | tail -n 8
lsmod | grep mlx5 || echo “no mlx5 modules loaded”
journalctl -b -k --no-pager -o short-iso -g ‘Detected insufficient power on the PCIe slot’ || true
journalctl -b -k --no-pager -o short-iso | grep -A12 ‘BERT: Error records from previous boot’
journalctl -b -k --no-pager -o short-iso | grep -A8 -B4 ‘pci 000f:01:00.0: DOE’
nvidia-smi -q | grep -E ‘Product Name|Driver Version|VBIOS Version|Bus Id|Device Max|Host Max|Current PCIe Generation|
Current[[:space:]]*:|Max Link Width|Current Link Width|GPU Recovery Action|HW Thermal Slowdown|HW Power Braking’
lspci -nn -s 000f:01:00.0
-7 f1f853cf605e4d079867980f1471795d Tue 2026-04-21 23:35:13 EDT Tue 2026-04-21 23:37:01 EDT
-6 0431acad65c848fea20840db339ac070 Tue 2026-04-21 23:37:30 EDT Tue 2026-04-21 23:57:10 EDT
-5 2f520109301f4ee991302588b388fd94 Sat 2026-05-02 14:58:20 EDT Sat 2026-05-02 15:17:55 EDT
-4 b4ed320cedc74f18a7fc37e7706e723b Sat 2026-05-02 15:19:07 EDT Sat 2026-05-02 15:38:26 EDT
-3 872d2b2927dd48488fc33e7ee8478802 Sat 2026-05-02 15:39:45 EDT Sat 2026-05-02 15:59:19 EDT
-2 dbee0bd7a8fe42efa96cb52970707cd6 Sat 2026-05-02 16:00:25 EDT Sat 2026-05-02 16:11:48 EDT
-1 8b8b62434be146c6af0c17c69378236a Sat 2026-05-02 16:12:56 EDT Sat 2026-05-02 16:32:34 EDT
0 2b7e61da2b424045b0772fcf4ac099c7 Sat 2026-05-02 16:33:38 EDT Sat 2026-05-02 16:47:15 EDT
no mlx5 modules loaded
– No entries –
2026-05-02T16:33:38-04:00 saispark kernel: BERT: Error records from previous boot:
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: It has been corrected by h/w and requires no further action
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: event severity: corrected
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: Error 0, type: corrected
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: section type: unknown, 3c1e3f4b-1e1a-43df-af28-59820e958e3c
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: section length: 0x3e
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: 00000000: 000d0000 544d0000 0044494b 00000000 …MTKID…
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: 00000010: 00000000 00000010 00000022 56190000 …"…V
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: 00000020: a451e0a4 c2964450 0ae9a1c7 0000fa95 ..Q.PD…
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: 00000030: 00 a0 00 00 00 00 00 00 00 00 00 00 00 80 …
2026-05-02T16:33:38-04:00 saispark kernel: BERT: Total records found: 1
2026-05-02T16:33:38-04:00 saispark kernel: pcieport 0000:00:00.0: Adding to iommu group 0
2026-05-02T16:33:38-04:00 saispark kernel: pcieport 0000:00:00.0: PME: Signaling with IRQ 329
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: PME# supported from D0 D3hot
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: [10de:2e12] type 00 class 0x030000 PCIe Endpoint
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: BAR 0 [mem 0x24000000-0x27ffffff 64bit pref]
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: Enabling HDA controller
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: DOE: [2c8] ABORT timed out
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: PCI bridge to [bus 01]
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: PCI bridge to [bus 01]
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: bridge window [mem 0x24000000-0x27ffffff 64bit pref]
2026-05-02T16:33:38-04:00 saispark kernel: pci_bus 000f:00: resource 4 [mem 0x24000000-0x281fffff window]
2026-05-02T16:33:38-04:00 saispark kernel: pci_bus 000f:01: resource 2 [mem 0x24000000-0x27ffffff 64bit pref]
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: Max Payload Size set to 256/ 512 (was 128), Max Read Rq 512
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: Max Payload Size set to 256/ 256 (was 128), Max Read Rq 512
==============NVSMI LOG==============
Timestamp : Sat May 2 16:47:17 2026
Driver Version : 580.95.05
CUDA Version : 13.0
Attached GPUs : 1
GPU 0000000F:01:00.0
Product Name : NVIDIA GB10
Product Brand : NVIDIA RTX
Product Architecture : Blackwell
Display Mode : Requested functionality has been deprecated
Display Attached : Yes
Display Active : Enabled
Persistence Mode : Enabled
Addressing Mode : ATS
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-c1ec4b1b-cae2-455f-0d1e-2823db17abbc
GPU PDI : 0xf7178f2ed4cdaacc
Minor Number : 0
VBIOS Version : 9A.0B.1E.00.00
MultiGPU Board : No
Board ID : 0xf0100
Board Part Number : N/A
GPU Part Number : 2E12-275-A1
FRU Part Number : N/A
Platform Info
Chassis Serial Number :
Slot Number : 0
Tray Index : 0
Host ID : 1
Peer Type : Direct Connected
Module Id : 1
GPU Fabric GUID : 0x0000000000000000
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : Enabled
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Recovery Action : None
GSP Firmware Version : 580.95.05
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x000F
Base Classcode : 0x3
Sub Classcode : 0x0
Device Id : 0x2E1210DE
Bus Id : 0000000F:01:00.0
Sub System Id : 0x000010DE
GPU Link Info
PCIe Generation
Max : 1
Current : 1
Device Current : 1
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 1x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : N/A
Rx Throughput : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Atomic Caps Inbound : N/A
Fan Speed : N/A
Performance State : P8
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Clocks Event Reasons Counters
SW Power Capping : 101734 us
Sync Boost : 0 us
SW Thermal Slowdown : 0 us
HW Thermal Slowdown : 0 us
HW Power Braking : 0 us
Sparse Operation Mode : N/A
FB Memory Usage
Total : N/A
Reserved : N/A
Used : N/A
Free : N/A
BAR1 Memory Usage
Total : N/A
Used : N/A
Free : N/A
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
GPU : 1 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
DRAM Encryption Mode
Current : N/A
Pending : N/A
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Channel Repair Pending : N/A
TPC Repair Pending : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 40 C
GPU T.Limit Temp : 55 C
GPU Shutdown T.Limit Temp : N/A
GPU Slowdown T.Limit Temp : N/A
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
GPU Power Readings
Average Power Draw : 5.21 W
Instantaneous Power Draw : 5.85 W
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
GPU Memory Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Smoothing : N/A
Workload Power Profiles
Requested Profiles : N/A
Enforced Profiles : N/A
Clocks
Graphics : 221 MHz
SM : 221 MHz
Memory : N/A
Video : 598 MHz
Applications Clocks
Graphics : 2418 MHz
Memory : N/A
Default Applications Clocks
Graphics : 2418 MHz
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 3003 MHz
SM : 3003 MHz
Memory : N/A
Video : 3003 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Summary : N/A
Bandwidth : N/A
Route Recovery in progress : N/A
Route Unhealthy : N/A
Access Timeout Recovery : N/A
Incorrect Configuration : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3882
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 297 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 4278
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 176 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 10350
Type : G
Name : /snap/firefox/8242/usr/lib/firefox/firefox
Used GPU Memory : 226 MiB
Capabilities
EGM : disabled
000f:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2e12] (rev a1)
Analyzed the bug report and inline data. The reboot loop has a consistent low-level signature.
A BERT hardware error record is present in the bug report — section type GUID 3c1e3f4b-1e1a-43df-af28-59820e958e3c, vendor MTK, single corrected hardware error.
Every boot hits the same DOE mailbox sequence on 000f:01:00.0 at ~1.13s:
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link
The PCIe link is degraded: 2.5GT/s x1 (downgraded) is confirmed in lspci from the bug report. BusMaster+ is still present, so the GPU partially enumerates and the OS can boot, but the link is not in a normal usable state.
The SBSA generic watchdog is armed:
ACPI GTDT: found 1 SBSA generic Watchdog(s)
The repeatable ~20-minute reboot cadence is consistent with a watchdog or firmware timeout path acting on the degraded platform state.
The mlx5 blacklist test was useful: the reboot interval did not change, which helps rule out the Mellanox path as the trigger. There are no Xid errors and no GSP fault chain in the collected data, so this does not currently look like a GPU runtime failure. The failure signature is at the PCIe / firmware enumeration layer.
The boot journal shows 8 consecutive boots hitting the same DOE abort sequence, so this is persistent rather than intermittent.