DGX Spark reboots every ~20 minutes

Hi,

I had purchased DGX Spark 4 months ago. I have not used in the last month and half. Now I see that i just reboots every 20 mins. I tried updating all the firmware but didn’t help.

Can I get some assistance please? Customer supported directed me to open a new topic on this forum.

Thank you.

"

Where is the ROTFLOL emoji when I need one.

Hi, I will need more information to help you out. After booting the Spark, can you login? If so, please generate an nvidia-bug-report by running in command line, nvidia-bug-report.sh. Also putjournalctl -k -b -1 -e in a log file so I can see your previous boot log and determine the shutdown cause.

nvidia-bug-report.log.gz (500.6 KB)

root@saispark:/home/suman# journalctl -k -b -1 -e
May 02 15:11:26 saispark kernel: audit: type=1400 audit(1777749086.555:353): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:11:31 saispark kernel: audit: type=1400 audit(1777749091.555:354): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:11:36 saispark kernel: audit: type=1400 audit(1777749096.556:355): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:11:41 saispark kernel: audit: type=1400 audit(1777749101.557:356): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:11:46 saispark kernel: audit: type=1400 audit(1777749106.565:357): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:11:51 saispark kernel: audit: type=1400 audit(1777749111.565:358): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:11:56 saispark kernel: audit: type=1400 audit(1777749116.632:359): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:01 saispark kernel: audit: type=1400 audit(1777749121.632:360): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:06 saispark kernel: audit: type=1400 audit(1777749126.648:361): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:11 saispark kernel: audit: type=1400 audit(1777749131.699:362): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:16 saispark kernel: audit: type=1400 audit(1777749136.700:363): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:21 saispark kernel: audit: type=1400 audit(1777749141.700:364): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:26 saispark kernel: audit: type=1400 audit(1777749146.701:365): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:31 saispark kernel: audit: type=1400 audit(1777749151.702:366): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:36 saispark kernel: audit: type=1400 audit(1777749156.705:367): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:41 saispark kernel: audit: type=1400 audit(1777749161.706:368): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:46 saispark kernel: audit: type=1400 audit(1777749166.706:369): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:51 saispark kernel: audit: type=1400 audit(1777749171.774:370): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:12:56 saispark kernel: audit: type=1400 audit(1777749176.775:371): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:13:01 saispark kernel: audit: type=1400 audit(1777749181.776:372): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:13:06 saispark kernel: audit: type=1400 audit(1777749186.780:373): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:13:12 saispark kernel: audit: type=1400 audit(1777749192.401:374): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:13:17 saispark kernel: audit: type=1400 audit(1777749197.792:375): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:13:22 saispark kernel: audit: type=1400 audit(1777749202.822:376): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:13:27 saispark kernel: audit: type=1400 audit(1777749207.822:377): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:13:33 saispark kernel: audit: type=1400 audit(1777749213.422:378): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:13:43 saispark kernel: audit: type=1400 audit(1777749223.148:379): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:14:00 saispark kernel: audit: type=1400 audit(1777749240.316:380): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:15:16 saispark kernel: audit: type=1400 audit(1777749316.873:381): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:15:16 saispark kernel: audit: type=1400 audit(1777749316.928:382): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:15:16 saispark kernel: audit: type=1400 audit(1777749316.939:383): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:15:16 saispark kernel: audit: type=1400 audit(1777749316.995:384): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:15:22 saispark kernel: audit: type=1400 audit(1777749322.602:385): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:15:27 saispark kernel: audit: type=1400 audit(1777749327.632:386): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:15:32 saispark kernel: audit: type=1400 audit(1777749332.633:387): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:15:33 saispark kernel: audit: type=1400 audit(1777749333.049:388): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:15:33 saispark kernel: audit: type=1400 audit(1777749333.103:389): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:15:38 saispark kernel: audit: type=1400 audit(1777749338.123:390): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:15:57 saispark kernel: audit: type=1400 audit(1777749357.635:391): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:15:57 saispark kernel: audit: type=1400 audit(1777749357.983:392): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:15:58 saispark kernel: audit: type=1400 audit(1777749358.038:393): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:16:02 saispark kernel: audit: type=1400 audit(1777749362.800:394): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:07 saispark kernel: audit: type=1400 audit(1777749367.804:395): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:12 saispark kernel: audit: type=1400 audit(1777749372.807:396): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:17 saispark kernel: audit: type=1400 audit(1777749377.817:397): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:22 saispark kernel: audit: type=1400 audit(1777749382.858:398): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:24 saispark kernel: audit: type=1400 audit(1777749384.593:399): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:16:24 saispark kernel: audit: type=1400 audit(1777749384.647:400): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:16:27 saispark kernel: audit: type=1400 audit(1777749387.891:401): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:32 saispark kernel: audit: type=1400 audit(1777749392.625:402): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:16:32 saispark kernel: audit: type=1400 audit(1777749392.678:403): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:16:32 saispark kernel: audit: type=1400 audit(1777749392.891:404): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:37 saispark kernel: audit: type=1400 audit(1777749397.894:405): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:42 saispark kernel: audit: type=1400 audit(1777749402.654:406): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:16:42 saispark kernel: audit: type=1400 audit(1777749402.708:407): apparmor=“DENIED” operation=“unlink” class=“file” profile=“snap.firefox.firefox” name=“/dev/char/195:254” pid=7592 comm=“CanvasRenderer” requested_mask=“d” denied_mask=“d” fsui>
May 02 15:16:42 saispark kernel: audit: type=1400 audit(1777749402.894:408): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:16:47 saispark kernel: audit: type=1400 audit(1777749407.975:409): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:17:23 saispark kernel: audit: type=1400 audit(1777749443.509:410): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:17:28 saispark kernel: audit: type=1400 audit(1777749448.520:411): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:17:33 saispark kernel: audit: type=1400 audit(1777749453.528:412): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:17:38 saispark kernel: audit: type=1400 audit(1777749458.537:413): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:17:43 saispark kernel: audit: type=1400 audit(1777749463.545:414): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:17:48 saispark kernel: audit: type=1400 audit(1777749468.568:415): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>
May 02 15:17:55 saispark kernel: audit: type=1400 audit(1777749475.714:416): apparmor=“DENIED” operation=“open” class=“file” profile=“snap.firefox.firefox” name=“/proc/pressure/memory” pid=7592 comm=“MemoryPoller” requested_mask=“r” denied_mask=“r” fsui>

Please see the output of nvidia-bug-report.sh attached and also the output of journalctl is included inline.

Apprecate the help.

I am also seeing a few errors related to insufficient power. Please see the following commands and output.

I was wondering why the insufficient power on PCIe slot is showing up. Appreciate the input. Thank you.

$ ./spark-collect-pcie-mlx5-support-log.sh 
===== date / uptime / kernel =====
2026-05-02T16:23:40-04:00
up 10 minutes
Linux saispark 6.14.0-1015-nvidia #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 25 18:02:16 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

===== boot history =====
-19 1ecf8852e6384428838e697209433e34 Tue 2026-04-21 20:41:44 EDT Tue 2026-04-21 20:43:57 EDT
-18 357e42a9ef944af9a19ef921f0974743 Tue 2026-04-21 20:44:41 EDT Tue 2026-04-21 20:52:12 EDT
-17 24f4cce48d444cfca26662635c1f4ecc Tue 2026-04-21 20:52:43 EDT Tue 2026-04-21 21:11:55 EDT
-16 152b41f9e3cb4e2280a760d722b3f28d Tue 2026-04-21 21:13:47 EDT Tue 2026-04-21 21:14:42 EDT
-15 a45f9d9550f34f8aa308557ff9abf7e2 Tue 2026-04-21 21:15:47 EDT Tue 2026-04-21 21:16:32 EDT
-14 d95a0ad82a3a44608675d2581119c363 Tue 2026-04-21 21:24:36 EDT Tue 2026-04-21 21:25:29 EDT
-13 40f0458171c94b439e10d81e83c94823 Tue 2026-04-21 21:27:21 EDT Tue 2026-04-21 21:45:01 EDT
-12 b93e56be0de44164a05beb08bf4b2eb2 Tue 2026-04-21 21:48:05 EDT Tue 2026-04-21 22:05:01 EDT
-11 2dd0acf6788e42dcb32b9d082fe78e56 Tue 2026-04-21 22:08:44 EDT Tue 2026-04-21 22:25:01 EDT
-10 cd33a67a266641b69eedb7d1c8d29051 Tue 2026-04-21 22:29:27 EDT Tue 2026-04-21 22:48:57 EDT
 -9 0125898f58504fb79a90d50d7f1942e0 Tue 2026-04-21 22:50:05 EDT Tue 2026-04-21 22:52:13 EDT
 -8 06600f32a8954882aa3ac4370278ac3a Tue 2026-04-21 22:53:46 EDT Tue 2026-04-21 23:10:08 EDT
 -7 05fd53ed92f14f9199ee8a05dcf47009 Tue 2026-04-21 23:14:30 EDT Tue 2026-04-21 23:32:32 EDT
 -6 f1f853cf605e4d079867980f1471795d Tue 2026-04-21 23:35:13 EDT Tue 2026-04-21 23:37:01 EDT
 -5 0431acad65c848fea20840db339ac070 Tue 2026-04-21 23:37:30 EDT Tue 2026-04-21 23:57:10 EDT
 -4 2f520109301f4ee991302588b388fd94 Sat 2026-05-02 14:58:20 EDT Sat 2026-05-02 15:17:55 EDT
 -3 b4ed320cedc74f18a7fc37e7706e723b Sat 2026-05-02 15:19:07 EDT Sat 2026-05-02 15:38:26 EDT
 -2 872d2b2927dd48488fc33e7ee8478802 Sat 2026-05-02 15:39:45 EDT Sat 2026-05-02 15:59:19 EDT
 -1 dbee0bd7a8fe42efa96cb52970707cd6 Sat 2026-05-02 16:00:25 EDT Sat 2026-05-02 16:11:48 EDT
  0 8b8b62434be146c6af0c17c69378236a Sat 2026-05-02 16:12:56 EDT Sat 2026-05-02 16:23:35 EDT

===== current mlx5 state =====
No mlx5 modules loaded

===== mlx5 blacklist =====
  blacklist mlx5_core
  blacklist mlx5_ib
  blacklist mlx5_fwctl
  

===== previous boot insufficient power warnings =====
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: mlx5_pcie_event:326:(pid 165): Detected insufficient power on the PCIe slot (27W).
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: mlx5_pcie_event:326:(pid 12): Detected insufficient power on the PCIe slot (27W).
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: mlx5_pcie_event:326:(pid 165): Detected insufficient power on the PCIe slot (27W).
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: mlx5_pcie_event:326:(pid 395): Detected insufficient power on the PCIe slot (27W).

===== previous boot related PCIe / mlx5 / hardware context =====
2026-05-02T16:00:25-04:00 saispark kernel: ACPI: BERT 0x0000000087065D98 000030 (v01 MTKID  MTKTABLE 00000001 CREA 00000001)
2026-05-02T16:00:25-04:00 saispark kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.14.0-1015-nvidia root=UUID=d27bfd26-ff30-400e-9eca-9cdf73de9406 ro init_on_alloc=0 console=tty0 plymouth.ignore-serial-consoles plymouth.use-simpledrm earlycon=uart,mmio32,0x16A00000 console=tty0 console=ttyS0,921600 crashkernel=1G-:0M quiet splash pci=pcie_bus_safe vt.handoff=7
2026-05-02T16:00:25-04:00 saispark kernel: ACPI: USB4 _OSC: OS supports USB3+ DisplayPort+ PCIe+ XDomain+
2026-05-02T16:00:25-04:00 saispark kernel: ACPI: USB4 _OSC: OS controls USB3+ DisplayPort+ PCIe+ XDomain+
2026-05-02T16:00:25-04:00 saispark kernel: acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:00:25-04:00 saispark kernel: pci 0000:00:00.0: [10de:22ce] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:00:25-04:00 saispark kernel: pci 0000:01:00.0: [15b3:1021] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:00:25-04:00 saispark kernel: pci 0000:01:00.1: [15b3:1021] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:00:25-04:00 saispark kernel: acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:00:25-04:00 saispark kernel: pci 0002:00:00.0: [10de:22ce] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:00:25-04:00 saispark kernel: pci 0002:01:00.0: [15b3:1021] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:00:25-04:00 saispark kernel: pci 0002:01:00.1: [15b3:1021] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:00:25-04:00 saispark kernel: acpi PNP0A08:02: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:00:25-04:00 saispark kernel: pci 0004:00:00.0: [10de:22ce] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:00:25-04:00 saispark kernel: pci 0004:01:00.0: [144d:a810] type 00 class 0x010802 PCIe Endpoint
2026-05-02T16:00:25-04:00 saispark kernel: acpi PNP0A08:03: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:00:25-04:00 saispark kernel: acpi PNP0A08:04: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:00:25-04:00 saispark kernel: pci 0007:00:00.0: [10de:22d0] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:00:25-04:00 saispark kernel: pci 0007:01:00.0: [10ec:8127] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:00:25-04:00 saispark kernel: acpi PNP0A08:05: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:00:25-04:00 saispark kernel: acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:00:25-04:00 saispark kernel: pci 0009:00:00.0: [10de:22d0] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:00:25-04:00 saispark kernel: pci 0009:01:00.0: [14c3:7925] type 00 class 0x028000 PCIe Endpoint
2026-05-02T16:00:25-04:00 saispark kernel: acpi PNP0A08:0b: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:00:25-04:00 saispark kernel: pci 000f:00:00.0: [10de:22d1] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:00:25-04:00 saispark kernel: pci 000f:01:00.0: [10de:2e12] type 00 class 0x030000 PCIe Endpoint
2026-05-02T16:00:25-04:00 saispark kernel: pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
2026-05-02T16:00:25-04:00 saispark kernel: BERT: Error records from previous boot:
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]: It has been corrected by h/w and requires no further action
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]: event severity: corrected
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]:  Error 0, type: corrected
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]:   section type: unknown, 3c1e3f4b-1e1a-43df-af28-59820e958e3c
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]:   section length: 0x3e
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]:   00000000: 000d0000 544d0000 0044494b 00000000  ......MTKID.....
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]:   00000010: 00000000 00000010 00000022 56190000  ........"......V
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]:   00000020: a451e0a4 c2964450 0ae9a1c7 0000fa95  ..Q.PD..........
2026-05-02T16:00:25-04:00 saispark kernel: [Hardware Error]:   00000030: 00 a0 00 00 00 00 00 00 00 00 00 00 00 80        ..............
2026-05-02T16:00:25-04:00 saispark kernel: BERT: Total records found: 1
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0000:00:00.0: Adding to iommu group 0
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0000:00:00.0: PME: Signaling with IRQ 329
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0000:00:00.0: AER: enabled with IRQ 330
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0002:00:00.0: Adding to iommu group 1
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0002:00:00.0: PME: Signaling with IRQ 332
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0002:00:00.0: AER: enabled with IRQ 333
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0004:00:00.0: Adding to iommu group 2
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0004:00:00.0: PME: Signaling with IRQ 335
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0004:00:00.0: AER: enabled with IRQ 336
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0004:00:00.0: pciehp: Slot #4 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0007:00:00.0: Adding to iommu group 3
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0007:00:00.0: PME: Signaling with IRQ 338
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0007:00:00.0: AER: enabled with IRQ 339
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0007:00:00.0: pciehp: Slot #7 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0009:00:00.0: Adding to iommu group 4
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0009:00:00.0: PME: Signaling with IRQ 341
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0009:00:00.0: AER: enabled with IRQ 342
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 0009:00:00.0: pciehp: Slot #9 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 000f:00:00.0: Adding to iommu group 5
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 000f:00:00.0: PME: Signaling with IRQ 343
2026-05-02T16:00:25-04:00 saispark kernel: pcieport 000f:00:00.0: AER: enabled with IRQ 345
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: Adding to iommu group 10
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: firmware version: 28.45.4028
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: Flow counters bulk query buffer size increased, bulk_query_len(8)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: mlx5e: IPSec ESP acceleration enabled
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: mlx5_pcie_event:326:(pid 165): Detected insufficient power on the PCIe slot (27W).
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: Adding to iommu group 15
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: enabling device (0000 -> 0002)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: firmware version: 28.45.4028
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: Flow counters bulk query buffer size increased, bulk_query_len(8)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: mlx5e: IPSec ESP acceleration enabled
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: Port module event: module 1, Cable unplugged
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: mlx5_pcie_event:326:(pid 12): Detected insufficient power on the PCIe slot (27W).
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: Adding to iommu group 16
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: enabling device (0000 -> 0002)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: firmware version: 28.45.4028
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: Flow counters bulk query buffer size increased, bulk_query_len(8)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: mlx5e: IPSec ESP acceleration enabled
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: Port module event: module 0, Cable unplugged
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: mlx5_pcie_event:326:(pid 165): Detected insufficient power on the PCIe slot (27W).
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: Adding to iommu group 17
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: enabling device (0000 -> 0002)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: firmware version: 28.45.4028
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: Flow counters bulk query buffer size increased, bulk_query_len(8)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: mlx5e: IPSec ESP acceleration enabled
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: Port module event: module 1, Cable unplugged
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: mlx5_pcie_event:326:(pid 395): Detected insufficient power on the PCIe slot (27W).
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.0 enp1s0f0np0: renamed from eth0
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.1 enP2p1s0f1np1: renamed from eth3
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0000:01:00.1 enp1s0f1np1: renamed from eth1
2026-05-02T16:00:25-04:00 saispark kernel: mlx5_core 0002:01:00.0 enP2p1s0f0np0: renamed from eth2
2026-05-02T16:00:25-04:00 saispark kernel:   MST::  : mst_init 1715: Mellanox Technologies Software Tools Driver - version 2.0.0
2026-05-02T16:00:30-04:00 saispark kernel: mlx5_core 0002:01:00.0 enP2p1s0f0np0: Link down
2026-05-02T16:00:30-04:00 saispark kernel: mlx5_core 0002:01:00.1 enP2p1s0f1np1: Link down
2026-05-02T16:00:32-04:00 saispark kernel: mlx5_core 0000:01:00.0 enp1s0f0np0: Link down
2026-05-02T16:00:32-04:00 saispark kernel: mlx5_core 0000:01:00.1 enp1s0f1np1: Link down

===== current boot related PCIe / mlx5 / hardware context =====
2026-05-02T16:12:56-04:00 saispark kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.14.0-1015-nvidia root=UUID=d27bfd26-ff30-400e-9eca-9cdf73de9406 ro init_on_alloc=0 console=tty0 plymouth.ignore-serial-consoles plymouth.use-simpledrm earlycon=uart,mmio32,0x16A00000 console=tty0 console=ttyS0,921600 crashkernel=1G-:0M quiet splash pci=pcie_bus_safe vt.handoff=7
2026-05-02T16:12:56-04:00 saispark kernel: ACPI: USB4 _OSC: OS supports USB3+ DisplayPort+ PCIe+ XDomain+
2026-05-02T16:12:56-04:00 saispark kernel: ACPI: USB4 _OSC: OS controls USB3+ DisplayPort+ PCIe+ XDomain+
2026-05-02T16:12:56-04:00 saispark kernel: acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:12:56-04:00 saispark kernel: pci 0000:00:00.0: [10de:22ce] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:12:56-04:00 saispark kernel: pci 0000:01:00.0: [15b3:1021] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:12:56-04:00 saispark kernel: pci 0000:01:00.1: [15b3:1021] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:12:56-04:00 saispark kernel: acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:12:56-04:00 saispark kernel: pci 0002:00:00.0: [10de:22ce] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:12:56-04:00 saispark kernel: pci 0002:01:00.0: [15b3:1021] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:12:56-04:00 saispark kernel: pci 0002:01:00.1: [15b3:1021] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:12:56-04:00 saispark kernel: acpi PNP0A08:02: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:12:56-04:00 saispark kernel: pci 0004:00:00.0: [10de:22ce] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:12:56-04:00 saispark kernel: pci 0004:01:00.0: [144d:a810] type 00 class 0x010802 PCIe Endpoint
2026-05-02T16:12:56-04:00 saispark kernel: acpi PNP0A08:03: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:12:56-04:00 saispark kernel: acpi PNP0A08:04: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:12:56-04:00 saispark kernel: pci 0007:00:00.0: [10de:22d0] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:12:56-04:00 saispark kernel: pci 0007:01:00.0: [10ec:8127] type 00 class 0x020000 PCIe Endpoint
2026-05-02T16:12:56-04:00 saispark kernel: acpi PNP0A08:05: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:12:56-04:00 saispark kernel: acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:12:56-04:00 saispark kernel: pci 0009:00:00.0: [10de:22d0] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:12:56-04:00 saispark kernel: pci 0009:01:00.0: [14c3:7925] type 00 class 0x028000 PCIe Endpoint
2026-05-02T16:12:56-04:00 saispark kernel: acpi PNP0A08:0b: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
2026-05-02T16:12:56-04:00 saispark kernel: pci 000f:00:00.0: [10de:22d1] type 01 class 0x060400 PCIe Root Port
2026-05-02T16:12:56-04:00 saispark kernel: pci 000f:01:00.0: [10de:2e12] type 00 class 0x030000 PCIe Endpoint
2026-05-02T16:12:56-04:00 saispark kernel: pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0000:00:00.0: Adding to iommu group 0
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0000:00:00.0: PME: Signaling with IRQ 329
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0000:00:00.0: AER: enabled with IRQ 330
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0002:00:00.0: Adding to iommu group 1
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0002:00:00.0: PME: Signaling with IRQ 332
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0002:00:00.0: AER: enabled with IRQ 333
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0004:00:00.0: Adding to iommu group 2
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0004:00:00.0: PME: Signaling with IRQ 335
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0004:00:00.0: AER: enabled with IRQ 336
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0004:00:00.0: pciehp: Slot #4 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0007:00:00.0: Adding to iommu group 3
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0007:00:00.0: PME: Signaling with IRQ 338
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0007:00:00.0: AER: enabled with IRQ 339
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0007:00:00.0: pciehp: Slot #7 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0009:00:00.0: Adding to iommu group 4
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0009:00:00.0: PME: Signaling with IRQ 341
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0009:00:00.0: AER: enabled with IRQ 342
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 0009:00:00.0: pciehp: Slot #9 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 000f:00:00.0: Adding to iommu group 5
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 000f:00:00.0: PME: Signaling with IRQ 343
2026-05-02T16:12:56-04:00 saispark kernel: pcieport 000f:00:00.0: AER: enabled with IRQ 345
2026-05-02T16:12:56-04:00 saispark kernel:   MST::  : mst_init 1715: Mellanox Technologies Software Tools Driver - version 2.0.0
suman@saispark:~$ 

To troubleshoot insufficient power issue, took the following steps. This did not address the reboot problem. any ideas on what else need to looked at? Thank you

Boot-time blacklist test:

  1. Created this file:
    /etc/modprobe.d/blacklist-mlx5-test.conf

  2. File contents:
    blacklist mlx5_core
    blacklist mlx5_ib
    blacklist mlx5_fwctl

  3. Rebuilt initramfs:
    sudo update-initramfs -u

  4. Rebooted the system.

  5. After reboot, confirmed mlx5 modules were not loaded:
    lsmod | grep mlx5

  6. The command returned no output.

  7. Confirmed current boot had no new mlx5 PCIe insufficient-power messages:
    journalctl -b -k --no-pager -o short-iso -g ‘Detected insufficient power on the PCIe slot’

  8. That command returned:
    – No entries –

  9. Confirmed network was still using Wi-Fi:
    ip route get 1.1.1.1
    dev wlP9s9, src 192.168.68.90

Result:

  • Even with mlx5 blacklisted from boot, no mlx5 modules loaded, and no current-boot mlx5 insufficient-power messages, the system still
    reset after about 19-20 minutes.
  • Boot history showed:
    • Previous boot: Sat 2026-05-02 16:12:56 EDT to Sat 2026-05-02 16:32:34 EDT
    • Duration: about 19 minutes 38 seconds
  • This matched the same hard reset interval seen before blacklisting mlx5.

Conclusion:

  • Disabling mlx5/Mellanox at boot did not stop the reset loop.

blacklisted mlx5 from boot and confirmed no mlx5 modules loaded, but the system still restarts at the same ~19-20 minute interval.
Current suspicious clues are BERT hardware error records from the previous boot and NVIDIA GPU PCIe/DOE/link-width anomalies on
000f:01:00.0.

journalctl --list-boots --no-pager | tail -n 8

lsmod | grep mlx5 || echo “no mlx5 modules loaded”

journalctl -b -k --no-pager -o short-iso -g ‘Detected insufficient power on the PCIe slot’ || true

journalctl -b -k --no-pager -o short-iso | grep -A12 ‘BERT: Error records from previous boot’

journalctl -b -k --no-pager -o short-iso | grep -A8 -B4 ‘pci 000f:01:00.0: DOE’

nvidia-smi -q | grep -E ‘Product Name|Driver Version|VBIOS Version|Bus Id|Device Max|Host Max|Current PCIe Generation|
Current[[:space:]]*:|Max Link Width|Current Link Width|GPU Recovery Action|HW Thermal Slowdown|HW Power Braking’

lspci -nn -s 000f:01:00.0
-7 f1f853cf605e4d079867980f1471795d Tue 2026-04-21 23:35:13 EDT Tue 2026-04-21 23:37:01 EDT
-6 0431acad65c848fea20840db339ac070 Tue 2026-04-21 23:37:30 EDT Tue 2026-04-21 23:57:10 EDT
-5 2f520109301f4ee991302588b388fd94 Sat 2026-05-02 14:58:20 EDT Sat 2026-05-02 15:17:55 EDT
-4 b4ed320cedc74f18a7fc37e7706e723b Sat 2026-05-02 15:19:07 EDT Sat 2026-05-02 15:38:26 EDT
-3 872d2b2927dd48488fc33e7ee8478802 Sat 2026-05-02 15:39:45 EDT Sat 2026-05-02 15:59:19 EDT
-2 dbee0bd7a8fe42efa96cb52970707cd6 Sat 2026-05-02 16:00:25 EDT Sat 2026-05-02 16:11:48 EDT
-1 8b8b62434be146c6af0c17c69378236a Sat 2026-05-02 16:12:56 EDT Sat 2026-05-02 16:32:34 EDT
0 2b7e61da2b424045b0772fcf4ac099c7 Sat 2026-05-02 16:33:38 EDT Sat 2026-05-02 16:47:15 EDT
no mlx5 modules loaded
– No entries –
2026-05-02T16:33:38-04:00 saispark kernel: BERT: Error records from previous boot:
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: It has been corrected by h/w and requires no further action
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]: event severity: corrected
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]:  Error 0, type: corrected
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]:   section type: unknown, 3c1e3f4b-1e1a-43df-af28-59820e958e3c
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]:   section length: 0x3e
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]:   00000000: 000d0000 544d0000 0044494b 00000000  …MTKID…
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]:   00000010: 00000000 00000010 00000022 56190000  …"…V
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]:   00000020: a451e0a4 c2964450 0ae9a1c7 0000fa95  ..Q.PD…
2026-05-02T16:33:38-04:00 saispark kernel: [Hardware Error]:   00000030: 00 a0 00 00 00 00 00 00 00 00 00 00 00 80        …
2026-05-02T16:33:38-04:00 saispark kernel: BERT: Total records found: 1
2026-05-02T16:33:38-04:00 saispark kernel: pcieport 0000:00:00.0: Adding to iommu group 0
2026-05-02T16:33:38-04:00 saispark kernel: pcieport 0000:00:00.0: PME: Signaling with IRQ 329
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: PME# supported from D0 D3hot
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: [10de:2e12] type 00 class 0x030000 PCIe Endpoint
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: BAR 0 [mem 0x24000000-0x27ffffff 64bit pref]
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: Enabling HDA controller
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: DOE: [2c8] ABORT timed out
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: PCI bridge to [bus 01]
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: PCI bridge to [bus 01]
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0:   bridge window [mem 0x24000000-0x27ffffff 64bit pref]
2026-05-02T16:33:38-04:00 saispark kernel: pci_bus 000f:00: resource 4 [mem 0x24000000-0x281fffff window]
2026-05-02T16:33:38-04:00 saispark kernel: pci_bus 000f:01: resource 2 [mem 0x24000000-0x27ffffff 64bit pref]
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:00:00.0: Max Payload Size set to  256/ 512 (was  128), Max Read Rq  512
2026-05-02T16:33:38-04:00 saispark kernel: pci 000f:01:00.0: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  512

==============NVSMI LOG==============

Timestamp                                 : Sat May  2 16:47:17 2026
Driver Version                            : 580.95.05
CUDA Version                              : 13.0

Attached GPUs                             : 1
GPU 0000000F:01:00.0
Product Name                          : NVIDIA GB10
Product Brand                         : NVIDIA RTX
Product Architecture                  : Blackwell
Display Mode                          : Requested functionality has been deprecated
Display Attached                      : Yes
Display Active                        : Enabled
Persistence Mode                      : Enabled
Addressing Mode                       : ATS
MIG Mode
Current                           : N/A
Pending                           : N/A
Accounting Mode                       : Disabled
Accounting Mode Buffer Size           : 4000
Driver Model
Current                           : N/A
Pending                           : N/A
Serial Number                         : N/A
GPU UUID                              : GPU-c1ec4b1b-cae2-455f-0d1e-2823db17abbc
GPU PDI                               : 0xf7178f2ed4cdaacc
Minor Number                          : 0
VBIOS Version                         : 9A.0B.1E.00.00
MultiGPU Board                        : No
Board ID                              : 0xf0100
Board Part Number                     : N/A
GPU Part Number                       : 2E12-275-A1
FRU Part Number                       : N/A
Platform Info
Chassis Serial Number             :
Slot Number                       : 0
Tray Index                        : 0
Host ID                           : 1
Peer Type                         : Direct Connected
Module Id                         : 1
GPU Fabric GUID                   : 0x0000000000000000
Inforom Version
Image Version                     : N/A
OEM Object                        : N/A
ECC Object                        : N/A
Power Management Object           : N/A
Inforom BBX Object Flush
Latest Timestamp                  : N/A
Latest Duration                   : N/A
GPU Operation Mode
Current                           : N/A
Pending                           : N/A
GPU C2C Mode                          : Enabled
GPU Virtualization Mode
Virtualization Mode               : None
Host VGPU Mode                    : N/A
vGPU Heterogeneous Mode           : N/A
GPU Recovery Action                   : None
GSP Firmware Version                  : 580.95.05
IBMNPU
Relaxed Ordering Mode             : N/A
PCI
Bus                               : 0x01
Device                            : 0x00
Domain                            : 0x000F
Base Classcode                    : 0x3
Sub Classcode                     : 0x0
Device Id                         : 0x2E1210DE
Bus Id                            : 0000000F:01:00.0
Sub System Id                     : 0x000010DE
GPU Link Info
PCIe Generation
Max                       : 1
Current                   : 1
Device Current            : 1
Device Max                : 5
Host Max                  : 5
Link Width
Max                       : 16x
Current                   : 1x
Bridge Chip
Type                          : N/A
Firmware                      : N/A
Replays Since Reset               : 0
Replay Number Rollovers           : 0
Tx Throughput                     : N/A
Rx Throughput                     : N/A
Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Atomic Caps Inbound               : N/A
Fan Speed                             : N/A
Performance State                     : P8
Clocks Event Reasons
Idle                              : Not Active
Applications Clocks Setting       : Not Active
SW Power Cap                      : Not Active
HW Slowdown                       : Not Active
HW Thermal Slowdown           : Not Active
HW Power Brake Slowdown       : Not Active
Sync Boost                        : Not Active
SW Thermal Slowdown               : Not Active
Display Clock Setting             : Not Active
Clocks Event Reasons Counters
SW Power Capping                  : 101734 us
Sync Boost                        : 0 us
SW Thermal Slowdown               : 0 us
HW Thermal Slowdown               : 0 us
HW Power Braking                  : 0 us
Sparse Operation Mode                 : N/A
FB Memory Usage
Total                             : N/A
Reserved                          : N/A
Used                              : N/A
Free                              : N/A
BAR1 Memory Usage
Total                             : N/A
Used                              : N/A
Free                              : N/A
Conf Compute Protected Memory Usage
Total                             : 0 MiB
Used                              : 0 MiB
Free                              : 0 MiB
Compute Mode                          : Default
Utilization
GPU                               : 1 %
Memory                            : 0 %
Encoder                           : 0 %
Decoder                           : 0 %
JPEG                              : 0 %
OFA                               : 0 %
Encoder Stats
Active Sessions                   : 0
Average FPS                       : 0
Average Latency                   : 0
FBC Stats
Active Sessions                   : 0
Average FPS                       : 0
Average Latency                   : 0
DRAM Encryption Mode
Current                           : N/A
Pending                           : N/A
ECC Mode
Current                           : N/A
Pending                           : N/A
ECC Errors
Volatile
SRAM Correctable              : N/A
SRAM Uncorrectable Parity     : N/A
SRAM Uncorrectable SEC-DED    : N/A
DRAM Correctable              : N/A
DRAM Uncorrectable            : N/A
Aggregate
SRAM Correctable              : N/A
SRAM Uncorrectable Parity     : N/A
SRAM Uncorrectable SEC-DED    : N/A
DRAM Correctable              : N/A
DRAM Uncorrectable            : N/A
SRAM Threshold Exceeded       : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2                       : N/A
SRAM SM                       : N/A
SRAM Microcontroller          : N/A
SRAM PCIE                     : N/A
SRAM Other                    : N/A
Channel Repair Pending            : N/A
TPC Repair Pending                : N/A
Retired Pages
Single Bit ECC                    : N/A
Double Bit ECC                    : N/A
Pending Page Blacklist            : N/A
Remapped Rows                         : N/A
Temperature
GPU Current Temp                  : 40 C
GPU T.Limit Temp                  : 55 C
GPU Shutdown T.Limit Temp         : N/A
GPU Slowdown T.Limit Temp         : N/A
GPU Max Operating T.Limit Temp    : 0 C
GPU Target Temperature            : N/A
Memory Current Temp               : N/A
Memory Max Operating T.Limit Temp : N/A
GPU Power Readings
Average Power Draw                : 5.21 W
Instantaneous Power Draw          : 5.85 W
Current Power Limit               : N/A
Requested Power Limit             : N/A
Default Power Limit               : N/A
Min Power Limit                   : N/A
Max Power Limit                   : N/A
GPU Memory Power Readings
Average Power Draw                : N/A
Instantaneous Power Draw          : N/A
Module Power Readings
Average Power Draw                : N/A
Instantaneous Power Draw          : N/A
Current Power Limit               : N/A
Requested Power Limit             : N/A
Default Power Limit               : N/A
Min Power Limit                   : N/A
Max Power Limit                   : N/A
Power Smoothing                       : N/A
Workload Power Profiles
Requested Profiles                : N/A
Enforced Profiles                 : N/A
Clocks
Graphics                          : 221 MHz
SM                                : 221 MHz
Memory                            : N/A
Video                             : 598 MHz
Applications Clocks
Graphics                          : 2418 MHz
Memory                            : N/A
Default Applications Clocks
Graphics                          : 2418 MHz
Memory                            : N/A
Deferred Clocks
Memory                            : N/A
Max Clocks
Graphics                          : 3003 MHz
SM                                : 3003 MHz
Memory                            : N/A
Video                             : 3003 MHz
Max Customer Boost Clocks
Graphics                          : N/A
Clock Policy
Auto Boost                        : N/A
Auto Boost Default                : N/A
Fabric
State                             : N/A
Status                            : N/A
CliqueId                          : N/A
ClusterUUID                       : N/A
Health
Summary                       : N/A
Bandwidth                     : N/A
Route Recovery in progress    : N/A
Route Unhealthy               : N/A
Access Timeout Recovery       : N/A
Incorrect Configuration       : N/A
Processes
GPU instance ID                   : N/A
Compute instance ID               : N/A
Process ID                        : 3882
Type                          : G
Name                          : /usr/lib/xorg/Xorg
Used GPU Memory               : 297 MiB
GPU instance ID                   : N/A
Compute instance ID               : N/A
Process ID                        : 4278
Type                          : G
Name                          : /usr/bin/gnome-shell
Used GPU Memory               : 176 MiB
GPU instance ID                   : N/A
Compute instance ID               : N/A
Process ID                        : 10350
Type                          : G
Name                          : /snap/firefox/8242/usr/lib/firefox/firefox
Used GPU Memory               : 226 MiB
Capabilities
EGM                               : disabled

000f:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2e12] (rev a1)

Analyzed the bug report and inline data. The reboot loop has a consistent low-level signature.

A BERT hardware error record is present in the bug report — section type GUID 3c1e3f4b-1e1a-43df-af28-59820e958e3c, vendor MTK, single corrected hardware error.

Every boot hits the same DOE mailbox sequence on 000f:01:00.0 at ~1.13s:

pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link

The PCIe link is degraded: 2.5GT/s x1 (downgraded) is confirmed in lspci from the bug report. BusMaster+ is still present, so the GPU partially enumerates and the OS can boot, but the link is not in a normal usable state.

The SBSA generic watchdog is armed:

ACPI GTDT: found 1 SBSA generic Watchdog(s)

The repeatable ~20-minute reboot cadence is consistent with a watchdog or firmware timeout path acting on the degraded platform state.

The mlx5 blacklist test was useful: the reboot interval did not change, which helps rule out the Mellanox path as the trigger. There are no Xid errors and no GSP fault chain in the collected data, so this does not currently look like a GPU runtime failure. The failure signature is at the PCIe / firmware enumeration layer.

The boot journal shows 8 consecutive boots hitting the same DOE abort sequence, so this is persistent rather than intermittent.