Crashing every few minutes

My DGX Spark has started crashing every few minutes. I was on the 6.14 kernel, and in troubleshooting similar posts on the forums tried 6.17 and the latest firmware updates. No change in behavior.

I downloaded and installed the field diagnostics software, and tried multiple times to complete the tests and it crashes every time in different locations. I have the logs ready to upload.

Please DM me the logs

1 Like

Is it really crashing or just rebooting for no reason? I was experiencing similar behavior on one of my Sparks and turns out the watchdog module was not loading, forcing the box to reboot. May want to do a quick spot check if you are in the same position by running this:

lsmod | grep sbsa_gwdt

You should see the output below and if ends up empty, probably running on the same problem.
lsmod | grep sbsa_gwdt
sbsa_gwdt 20480 1

Easy to fix though.

Thanks for the tip kosta. I looked into it and it wasn’t running. Is stable for now, but we’ll see how it goes.

The BIOS offers an option to disable the watchdog. See Advanced–>Watchdog Timer setting. The watchdog is controlled by the firmware:

elsaco@spark1:~$ sudo wdctl
[sudo] password for elsaco: 
Device:        /dev/watchdog0
Identity:      SBSA Generic Watchdog [version 0]
Timeout:       10 seconds
Timeleft:      1266874887 seconds
Pre-timeout:    0 seconds
FLAG           DESCRIPTION                   STATUS BOOT-STATUS
CARDRESET      Card previously reset the CPU      0           0
KEEPALIVEPING  Keep alive ping reply              1           0
MAGICCLOSE     Supports magic close char          0           0
SETTIMEOUT     Set timeout (in seconds)           0           0

Notice the 40 years Timeleft counter!

Since these are development systems and expected to break disabling the watchdog might be beneficial to find out what just crashed instead of rebooting.