I was able to run a few demo code. However, every 20-30 mininutes, the DGX Spark device just decided to reboot on its own despite i have no workload running on it.
Any way that we can fix this? This does not seem to be normal.
Can anyone confirm if any of the following AI suggested ideas would work?
Because your system is likely in a bad state from the failed update, the official solution is to reflash the system using a recovery image.
Get the Recovery Media: You will need to download the official DGX Spark recovery image from NVIDIA. This is enterprise hardware, so you should find this on your NVIDIA Enterprise Support portal or on the DGX Spark product support page.
Follow NVIDIA’s Instructions: The NVIDIA Developer Forums (specifically the “DGX Spark / GB10” section) have threads on this. The official guidance is to reflash the system to restore it to a clean, working state.
Contact NVIDIA Support: This is an expensive, enterprise-grade machine. Your fastest and safest path to a solution is to contact NVIDIA Enterprise Support directly. They are aware of this specific “boot loop” issue and will have the exact instructions and files you need.
I have had a lot of crashes. The DGX Spark crashes In a lot of different ways when you run out of memory. Often I will get a hard lock, where the mouse cursor just isn’t responsive. Other times, I will come back to a setup that has rebooted completely to the original user login.
If you run DGX Dashboard, do you see your memory running up? Even if your setup is idle, if you have out of control docker containers running with just health checks, you can run into issues if they aren’t constrained.
If you don’t have that issue, reinstalling from scratch is good.
The DGX Spark is too new and beyond the training cut off date for all LLMs and the boot loop was an issue in the first two days of release where you couldn’t boot into the system at all, not the rebooting scenario you describe.
@kylezheng04 Just a thought… If this is something happening at a deep system level you may want to have a look at log files in /var/log. Some of these files might shed some light on what’s causing the system to crash/reboot.
Easy way: Use the recovery media to reflash it. I had to do few times. Each time, make sure to check DGX dashboard to update whatever available then do
Hard way: if you want to troubleshoot and don’t want it to restart automatically, reboot, go to BIOS/UEFI by pressing DEL key repeteadly. Advanced → Advanced → Watchdog. This automatically restart DGX when it detects somethings wrong. I think Somethings wrong is “subjective”. But still Turn it off at your own risk!
The thing works is to start a new recovery image on a clean ubuntu 24.04 desktop PC. and then put that recovery system back to the DGX Spark internal SSD. This fixed the problem.
I was using 1016 linux kernel, has the 30mins rebooting issue. Then i switch to 1014, have the same issues. So i conclude that this would not work.
```Verify the running kernel with uname -r. If the system is on a kernel version known to cause instability (e.g., 1016), revert to a previous stable kernel or update to the latest DGX OS kernel via the GRUB menu or by reinstalling the latest DGX OS image.```