My DGX Spark has started crashing every few minutes. I was on the 6.14 kernel, and in troubleshooting similar posts on the forums tried 6.17 and the latest firmware updates. No change in behavior.
I downloaded and installed the field diagnostics software, and tried multiple times to complete the tests and it crashes every time in different locations. I have the logs ready to upload.
Is it really crashing or just rebooting for no reason? I was experiencing similar behavior on one of my Sparks and turns out the watchdog module was not loading, forcing the box to reboot. May want to do a quick spot check if you are in the same position by running this:
lsmod | grep sbsa_gwdt
You should see the output below and if ends up empty, probably running on the same problem. lsmod | grep sbsa_gwdt sbsa_gwdt 20480 1
The BIOS offers an option to disable the watchdog. See Advanced–>Watchdog Timer setting. The watchdog is controlled by the firmware:
elsaco@spark1:~$ sudo wdctl
[sudo] password for elsaco:
Device: /dev/watchdog0
Identity: SBSA Generic Watchdog [version 0]
Timeout: 10 seconds
Timeleft: 1266874887 seconds
Pre-timeout: 0 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
CARDRESET Card previously reset the CPU 0 0
KEEPALIVEPING Keep alive ping reply 1 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
Notice the 40 years Timeleft counter!
Since these are development systems and expected to break disabling the watchdog might be beneficial to find out what just crashed instead of rebooting.