Hey guys I need your advice to get the DGX Spark more stable.
For now most of the time it crashes for one of two reasons.
NOT ENOUGH RAM
1st. out of memory - the model that is loaded tries to allocate more ram than the DGX Spark has..(I can see the DGX Dashboard filling the System memory up to 127.X GB and then the system freezes.)
Is there a way to tell Linux to reserve ram for the OS.. so it does not crash?
OVERHEAT SHUTDOWN
If the model (like Qwen3.6) loads successfully and is working on a longer mode difficult prompt or benchmark I notice the ACPI Temperature rising up to 95°C (203°F) before the system reboots itself.
Have placed a big fan in front of the DGX Spark but looks like that is not enough. And summer is not here yet so I guess that issue will be more frequent.
I know there is a whole thread about just this.
I love to get to a state where the spark runs and I can place it in the Basement “set and forget” without me having to go there and push the power button ever couple of times.
I have an out-of-the-box cooling idea, but it would almost certainly void the warranty. I think Comino might be able to provide a solution if there is enough demand and if users are willing to accept the warranty risk:
Comino also sells the DGX Spark, so they should already be familiar with the machine. Who knows — they might even be able to work with NVIDIA to offer a special liquid-cooled version of the DGX Spark.
Based on the teardown photos in these reviews, there appears to be enough internal space to replace the existing heatsink with a water block:
A new or modified enclosure would likely be needed to route the water-cooling tubes and support the external cooling loop.
My DGX started crashing too, I ran one claude code prompt against a large code repo just to give it a longer task (as a test) and my temps didn’t even reach very high for the GPU and the node went down shortly after.
EDIT: I threw some metric reporting on my Spark and I can now see the TSOC (the chip as a whole) hits around 95C and the TS1P (some of the CPU performance cores) also hits around 95C.
This might work but I would expect that to reduce performance by nearly ~10% or more. I can run a benchmark tonight and report back numbers for Qwen 3.6 on my cluster
However, the process consistently crashes after a few runs, preventing completion of a full benchmark cycle.
The mitigation/workaround that @rafaelkallis mentioned above (locking the GPU clock) seems to be effective. Once nvidia-smi --lock-gpu-clocks 0,2150 is applied, I was able to successfully complete a benchmark run on my cluster (2 nodes). The results are below:
model
test
t/s (avg ± std)
peak t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
qwen-3.6-awq
pp2048 @ d4096
2096.43 ± 570.42
2972.04 ± 688.61
2845.45 ± 688.61
2972.04 ± 688.61
qwen-3.6-awq
tg32 @ d4096
24.03 ± 0.00
26.00 ± 0.00
qwen-3.6-awq
pp2048 @ d8192
1257.02 ± 0.39
7494.41 ± 40.05
7367.82 ± 40.05
7494.41 ± 40.05
qwen-3.6-awq
tg32 @ d8192
23.92 ± 0.17
26.00 ± 0.00
qwen-3.6-awq
pp2048 @ d16384
1319.11 ± 4.27
12833.34 ± 142.70
12706.75 ± 142.70
12833.34 ± 142.70
qwen-3.6-awq
tg32 @ d16384
23.38 ± 0.75
25.50 ± 0.50
qwen-3.6-awq
pp2048 @ d32768
1396.66 ± 55.64
22738.21 ± 886.04
22611.62 ± 886.04
22738.21 ± 886.04
qwen-3.6-awq
tg32 @ d32768
24.41 ± 0.16
25.50 ± 1.50
qwen-3.6-awq
pp2048 @ d65536
1267.68 ± 8.07
48512.18 ± 400.18
48385.59 ± 400.18
48512.18 ± 400.18
qwen-3.6-awq
tg32 @ d65536
21.46 ± 0.20
24.50 ± 0.50
qwen-3.6-awq
pp2048 @ d131072
1089.82 ± 9.53
110838.49 ± 1288.67
110711.90 ± 1288.67
110838.49 ± 1288.67
qwen-3.6-awq
tg32 @ d131072
20.91 ± 0.62
25.50 ± 0.50
Just based on this behavior, I’d believe that GPU clock behavior (and likely thermals) plays a significant role in system reliability under sustained load.
I have new thermal paste on order to see if it will help resolve the issue. In the meantime, underclocking or locking GPU clocks appears to be a practical workaround for maintaining stability. While not an ideal long-term solution, it does allow the system to operate reliably enough for benchmarking (and assuming general use).
NOT ENOUGH RAM
"I’m running into the out-of-memory freezing issue again and again.
When the memory usage climbs to 126.5GB and leaves about 1.5GB free, the Linux desktop completely locks up.
Has anyone had success adjusting sysctl limits to reserve 2GB strictly for the OS? I just want to keep the display and peripherals responsive instead of the machine freezing. If anyone has a better workaround for memory reservation, I’d love to hear it."
Will try to Adjust vm.min_free_kbytes
To force the Linux kernel to keep a specific amount of RAM absolutely free, ensuring the OS and the peripherals don’t lock up.
First I will try it temporarily and reserve 2GB (2097152 KB),
open a terminal and run: sudo sysctl -w vm.min_free_kbytes=2097152
(To make it persist after a reboot, add vm.min_free_kbytes = 2097152 to your /etc/sysctl.conf file).
Did not help - still cashing
is there a command to check if I really have 128GB of working ram and not 126.5GB… maybe the RAM is defect?
I just disable swapping, you get insta-kill OOM, but it’s better than the whole machine hanging and powering it off and on, especially when working remote.
To reserve memory for the kernel on a UMA system the best option would be at boot using movablecore option and create a safe-zone for the OS.
The Linux kernel caches aggressively, i.e. file readings, with the Page Cache stealing memory from your LLM workloads so adjusting the vfs_cache_pressure from the default setting might help:
sudo sysctl -w vm.vfs_cache_pressure=200
would force the kernel to give up its caches more easily. It defaults to 100
I use swappiness to 0 (I had it on 10, but actually lowered to 0 yesterday!)
-Check current value: cat /proc/sys/vm/swappiness
-Temporarily set to 0: sudo sysctl vm.swappiness=0
then if no issues are found:
-Permanently set: Add vm.swappiness=0 to /etc/sysctl.conf.
What it does is just using swap when memory is 100% full, not before. It will save me from a crash and the “lock ups” for filling the swap are minimal since it’s only writing what it needs and you can see it slowing down and catch it on time.
Removing the device housing and adding extra cooling seems to be about the only practical solution. I use an ASUS GX10, and even without any additional cooling, it maintains stable temperatures below 70°C during full-load workloads lasting several hours to several dozen hours. The only condition that causes throttling is when both the GPU and CPU are under full load at the same time. At least on the GX10, GPU core cooling is excellent, but CPU cooling is somewhat lacking.
I stand my GX10 on its sides and it seems to help. I don’t understand why Asus decided to place the air intake at the bottom of the device with just a few mm clearance from the desk. Wouldn’t that affect the airflow? MSI EdgeXpert design is much more logical. Suck air in from the front and vent it out the back. Simple and efficient.
And removing the dust from the front is probably also a good idea.
Had my Spark facing the external fan and I looked at the backside (for easier access of the power button)
I can confirm this has been behaving well, running some agents for testing. swappiness=1 - allows me to push memory to the limit and if things just trickle over into swap, it’s all ok. Thanks @azampatti
I also have mine on their sides, with plenty of airflow and occasionally blast the front grille with compressed air cleaner.