DGX Spark stability / out of RAM / overheating

martinB78 · April 30, 2026, 6:13am

Hey guys I need your advice to get the DGX Spark more stable.

For now most of the time it crashes for one of two reasons.

NOT ENOUGH RAM
1st. out of memory - the model that is loaded tries to allocate more ram than the DGX Spark has..(I can see the DGX Dashboard filling the System memory up to 127.X GB and then the system freezes.)

Tried to update my Stack so it calculates the right gpu_memory_utilization to match the free ram. Step 6 — Dynamic VRAM Launcher for Large Models but looks like it does not work right yet.

[auto-gmem] cfg: layers=40 kv_heads=2 head_dim=256 ctx=262144 batch=10 kvb=1 kv_batch_used=4
[auto-gmem] weights=29.40GiB kv=40.00GiB safety=4GiB → need=73.40GiB
[auto-gmem] free=90.39GiB total=115.39GiB (CUDA view) → gpu_memory_utilization=0.64 [sized]

Is there a way to tell Linux to reserve ram for the OS.. so it does not crash?

OVERHEAT SHUTDOWN
If the model (like Qwen3.6) loads successfully and is working on a longer mode difficult prompt or benchmark I notice the ACPI Temperature rising up to 95°C (203°F) before the system reboots itself.

Have placed a big fan in front of the DGX Spark but looks like that is not enough. And summer is not here yet so I guess that issue will be more frequent.

I know there is a whole thread about just this.

I love to get to a state where the spark runs and I can place it in the Basement “set and forget” without me having to go there and push the power button ever couple of times.

Thanks guys

paulsc.liu · April 30, 2026, 7:04am

I have an out-of-the-box cooling idea, but it would almost certainly void the warranty. I think Comino might be able to provide a solution if there is enough demand and if users are willing to accept the warranty risk:

Comino also sells the DGX Spark, so they should already be familiar with the machine. Who knows — they might even be able to work with NVIDIA to offer a special liquid-cooled version of the DGX Spark.

Based on the teardown photos in these reviews, there appears to be enough internal space to replace the existing heatsink with a water block:

A new or modified enclosure would likely be needed to route the water-cooling tubes and support the external cooling loop.

giles8 · April 30, 2026, 8:16am

Or maybe just swap it out for an OEM that has better cooling…

https://ipc.msi.com/blog_detail/unlocking-potential-why-is-msi-edgexperts-ai-performance-faster-despite-using-the-same-grace-blackwell-architecture

rafaelkallis · April 30, 2026, 1:03pm

@martinB78 no more overheating for me with this one (needs sudo)

nvidia-smi --lock-gpu-clocks 0,2150

azampatti · April 30, 2026, 3:38pm

Did you run benchmarks before/after that command? I wonder what the performance tradeoff are

jan.fe · April 30, 2026, 3:50pm

I’m using the lenovo variant, so far rock stable, no restarts due to overheating, even with sustained heavy workloads.

fishnotphish · April 30, 2026, 4:07pm

My DGX started crashing too, I ran one claude code prompt against a large code repo just to give it a longer task (as a test) and my temps didn’t even reach very high for the GPU and the node went down shortly after.

EDIT: I threw some metric reporting on my Spark and I can now see the TSOC (the chip as a whole) hits around 95C and the TS1P (some of the CPU performance cores) also hits around 95C.

fishnotphish · May 1, 2026, 12:27am

This might work but I would expect that to reduce performance by nearly ~10% or more. I can run a benchmark tonight and report back numbers for Qwen 3.6 on my cluster

fishnotphish · May 1, 2026, 1:48am

I’ve been attempting to benchmark using:

llama-benchy --base-url <url> --model <model> --depth 4096 8192 16384 32768 65536 131072 --latency-mode generation

However, the process consistently crashes after a few runs, preventing completion of a full benchmark cycle.

The mitigation/workaround that @rafaelkallis mentioned above (locking the GPU clock) seems to be effective. Once nvidia-smi --lock-gpu-clocks 0,2150 is applied, I was able to successfully complete a benchmark run on my cluster (2 nodes). The results are below:

model	test	t/s (avg ± std)	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen-3.6-awq	pp2048 @ d4096	2096.43 ± 570.42		2972.04 ± 688.61	2845.45 ± 688.61	2972.04 ± 688.61
qwen-3.6-awq	tg32 @ d4096	24.03 ± 0.00	26.00 ± 0.00
qwen-3.6-awq	pp2048 @ d8192	1257.02 ± 0.39		7494.41 ± 40.05	7367.82 ± 40.05	7494.41 ± 40.05
qwen-3.6-awq	tg32 @ d8192	23.92 ± 0.17	26.00 ± 0.00
qwen-3.6-awq	pp2048 @ d16384	1319.11 ± 4.27		12833.34 ± 142.70	12706.75 ± 142.70	12833.34 ± 142.70
qwen-3.6-awq	tg32 @ d16384	23.38 ± 0.75	25.50 ± 0.50
qwen-3.6-awq	pp2048 @ d32768	1396.66 ± 55.64		22738.21 ± 886.04	22611.62 ± 886.04	22738.21 ± 886.04
qwen-3.6-awq	tg32 @ d32768	24.41 ± 0.16	25.50 ± 1.50
qwen-3.6-awq	pp2048 @ d65536	1267.68 ± 8.07		48512.18 ± 400.18	48385.59 ± 400.18	48512.18 ± 400.18
qwen-3.6-awq	tg32 @ d65536	21.46 ± 0.20	24.50 ± 0.50
qwen-3.6-awq	pp2048 @ d131072	1089.82 ± 9.53		110838.49 ± 1288.67	110711.90 ± 1288.67	110838.49 ± 1288.67
qwen-3.6-awq	tg32 @ d131072	20.91 ± 0.62	25.50 ± 0.50

Just based on this behavior, I’d believe that GPU clock behavior (and likely thermals) plays a significant role in system reliability under sustained load.

I have new thermal paste on order to see if it will help resolve the issue. In the meantime, underclocking or locking GPU clocks appears to be a practical workaround for maintaining stability. While not an ideal long-term solution, it does allow the system to operate reliably enough for benchmarking (and assuming general use).

rafaelkallis · May 1, 2026, 8:29am

Here you go.

--lock-gpu-clocks 0,2150

| model             |         test |      t/s (total) |       t/s (req) |       peak t/s |   peak t/s (req) |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:------------------|-------------:|-----------------:|----------------:|---------------:|-----------------:|------------------:|------------------:|------------------:|
| Qwen3.5-122B-A10B |  pp2048 (c1) |  2294.64 ± 32.92 | 2294.64 ± 32.92 |                |                  |    982.22 ± 12.86 |    893.28 ± 12.86 |    982.22 ± 12.86 |
| Qwen3.5-122B-A10B |   tg128 (c1) |     28.75 ± 0.04 |    28.75 ± 0.04 |   33.00 ± 0.00 |     33.00 ± 0.00 |                   |                   |                   |
| Qwen3.5-122B-A10B |  pp2048 (c4) | 2497.45 ± 166.66 |  654.64 ± 46.50 |                |                  |  3235.59 ± 235.15 |  3146.65 ± 235.15 |  3235.59 ± 235.15 |
| Qwen3.5-122B-A10B |   tg128 (c4) |     68.81 ± 1.75 |    17.97 ± 0.63 |   90.00 ± 1.41 |     22.83 ± 0.99 |                   |                   |                   |
| Qwen3.5-122B-A10B | pp2048 (c16) |  2848.79 ± 36.54 |   179.86 ± 2.42 |                |                  | 11483.08 ± 154.82 | 11394.14 ± 154.82 | 11483.08 ± 154.82 |
| Qwen3.5-122B-A10B |  tg128 (c16) |    185.35 ± 5.17 |    12.39 ± 0.52 | 330.33 ± 14.82 |     20.75 ± 0.99 |                   |                   |                   |

llama-benchy (0.3.7)

--reset-gpu-clocks

| model             |         test |      t/s (total) |       t/s (req) |       peak t/s |   peak t/s (req) |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:------------------|-------------:|-----------------:|----------------:|---------------:|-----------------:|------------------:|------------------:|------------------:|
| Qwen3.5-122B-A10B |  pp2048 (c1) |  2324.63 ± 18.38 | 2324.63 ± 18.38 |                |                  |     966.39 ± 6.99 |     881.63 ± 6.99 |     966.39 ± 6.99 |
| Qwen3.5-122B-A10B |   tg128 (c1) |     30.77 ± 0.58 |    30.77 ± 0.58 |   34.00 ± 0.00 |     34.00 ± 0.00 |                   |                   |                   |
| Qwen3.5-122B-A10B |  pp2048 (c4) | 2533.83 ± 218.48 |  665.11 ± 61.28 |                |                  |  3193.47 ± 305.88 |  3108.71 ± 305.88 |  3193.47 ± 305.88 |
| Qwen3.5-122B-A10B |   tg128 (c4) |     72.99 ± 3.10 |    19.02 ± 0.87 |   94.33 ± 1.70 |     24.17 ± 1.07 |                   |                   |                   |
| Qwen3.5-122B-A10B | pp2048 (c16) |  2933.42 ± 48.59 |   185.20 ± 3.29 |                |                  | 11151.95 ± 196.02 | 11067.19 ± 196.02 | 11151.95 ± 196.02 |
| Qwen3.5-122B-A10B |  tg128 (c16) |    186.76 ± 4.87 |    12.47 ± 0.50 | 323.33 ± 20.50 |     20.29 ± 1.34 |                   |                   |                   |

llama-benchy (0.3.7)

rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm
eugr’s nightly vllm image
Asus GX10

martinB78 · May 1, 2026, 9:31am

NOT ENOUGH RAM
"I’m running into the out-of-memory freezing issue again and again.
When the memory usage climbs to 126.5GB and leaves about 1.5GB free, the Linux desktop completely locks up.

Has anyone had success adjusting sysctl limits to reserve 2GB strictly for the OS? I just want to keep the display and peripherals responsive instead of the machine freezing. If anyone has a better workaround for memory reservation, I’d love to hear it."

Will try to Adjust vm.min_free_kbytes
To force the Linux kernel to keep a specific amount of RAM absolutely free, ensuring the OS and the peripherals don’t lock up.
First I will try it temporarily and reserve 2GB (2097152 KB),

open a terminal and run: sudo sysctl -w vm.min_free_kbytes=2097152

(To make it persist after a reboot, add vm.min_free_kbytes = 2097152 to your /etc/sysctl.conf file).

Did not help - still cashing

is there a command to check if I really have 128GB of working ram and not 126.5GB… maybe the RAM is defect?

wentbackward · May 1, 2026, 10:29am

I just disable swapping, you get insta-kill OOM, but it’s better than the whole machine hanging and powering it off and on, especially when working remote.

Not a solution, but it’s better.

sudo swapoff -a

elsaco · May 1, 2026, 2:47pm

To reserve memory for the kernel on a UMA system the best option would be at boot using movablecore option and create a safe-zone for the OS.

The Linux kernel caches aggressively, i.e. file readings, with the Page Cache stealing memory from your LLM workloads so adjusting the vfs_cache_pressure from the default setting might help:

sudo sysctl -w vm.vfs_cache_pressure=200

would force the kernel to give up its caches more easily. It defaults to 100

azampatti · May 1, 2026, 3:02pm

I use swappiness to 0 (I had it on 10, but actually lowered to 0 yesterday!)

-Check current value: cat /proc/sys/vm/swappiness
-Temporarily set to 0: sudo sysctl vm.swappiness=0
then if no issues are found: 
-Permanently set: Add vm.swappiness=0 to /etc/sysctl.conf.

What it does is just using swap when memory is 100% full, not before. It will save me from a crash and the “lock ups” for filling the swap are minimal since it’s only writing what it needs and you can see it slowing down and catch it on time.

ska0982 · May 2, 2026, 9:08am

Removing the device housing and adding extra cooling seems to be about the only practical solution. I use an ASUS GX10, and even without any additional cooling, it maintains stable temperatures below 70°C during full-load workloads lasting several hours to several dozen hours. The only condition that causes throttling is when both the GPU and CPU are under full load at the same time. At least on the GX10, GPU core cooling is excellent, but CPU cooling is somewhat lacking.

snoop54088 · May 2, 2026, 9:30am

I stand my GX10 on its sides and it seems to help. I don’t understand why Asus decided to place the air intake at the bottom of the device with just a few mm clearance from the desk. Wouldn’t that affect the airflow? MSI EdgeXpert design is much more logical. Suck air in from the front and vent it out the back. Simple and efficient.

mashie · May 2, 2026, 10:17am

With that design you do get the easy option to place a 120x120mm fan between the GX10 and the desk.

paulsc.liu · May 2, 2026, 1:59pm

More extreme cooling technique:

or buy from Aliexpress
https://www.aliexpress.com/item/1005012027879003.html?spm=a2g0o.detail.pcDetailTopMoreOtherSeller.5.4f125Oqc5OqcBn&gps-id=pcDetailTopMoreOtherSeller&scm=1007.40050.354490.0&scm_id=1007.40050.354490.0&scm-url=1007.40050.354490.0&pvid=36abe2c8-5961-45f4-b54e-c64b592cad86&_t=gps-id%3ApcDetailTopMoreOtherSeller%2Cscm-url%3A1007.40050.354490.0%2Cpvid%3A36abe2c8-5961-45f4-b54e-c64b592cad86%2Ctpp_buckets%3A668%232846%238112%231997&isseo=y&pdp_ext_f={"order"%3A"-1"%2C"spu_best_type"%3A"price"%2C"eval"%3A"1"%2C"sceneId"%3A"30050"%2C"fromPage"%3A"recommend"}&pdp_npi=6%40dis!TWD!5416.35!5308.03!!!1127.56!1105.01!%402102eaa317777321174516505eb745!12000057335292090!rec!TW!!ABX!1!0!n_tag%3A-29910%3Bd%3Ab9ef7cd5%3Bm03_new_user%3A-29895&utparam-url=scene%3ApcDetailTopMoreOtherSeller|query_from%3A|x_object_id%3A1005012027879003|_p_origin_prod%3A

martinB78 · May 2, 2026, 8:34pm

Good Idea, will give that a try too.

And removing the dust from the front is probably also a good idea.
Had my Spark facing the external fan and I looked at the backside (for easier access of the power button)

wentbackward · May 4, 2026, 11:46am

I can confirm this has been behaving well, running some agents for testing. swappiness=1 - allows me to push memory to the limit and if things just trickle over into swap, it’s all ok. Thanks @azampatti

I also have mine on their sides, with plenty of airflow and occasionally blast the front grille with compressed air cleaner.

Topic		Replies	Views
DGX Spark Shutdown around 95°C during nanoChat Pretraining (20-30 min) DGX Spark / GB10	21	1551	March 23, 2026
DGX Spark. low fan speed, high temps, device very hot DGX Spark / GB10 kernel , gpu , fan-facts , debugging-and-troubleshooting	59	4390	May 24, 2026
DGXSPARK temperature too high, automatic shutdown。 DGX Spark / GB10	146	3691	May 24, 2026
System crashes when memory is full DGX Spark / GB10	28	1906	December 22, 2025
DGX Spark becomes unresponsive (“zombie”) instead of throwing CUDA OOM DGX Spark / GB10	16	1496	April 10, 2026
DGX Spark constantly shut down after first set up DGX Spark / GB10	8	455	January 11, 2026
Dgx spark shut down without rebooting DGX Spark / GB10	22	503	May 20, 2026
DGX Spark Performance Degradation - GPU Power Draw Issue DGX Spark / GB10 power , performance , llama	64	3278	May 22, 2026
Status and Experience on Thermal Performance DGX Spark / GB10	21	3866	May 7, 2026
[Root Cause Analysis] DGX Spark driver failure — kernel 6.17.0-1008-nvidia aarch64 panics cause DOE mailbox failure (pstore evidence) DGX Spark / GB10 pcie , boot , kernel , ota , driver	3	378	April 17, 2026

DGX Spark stability / out of RAM / overheating

Related topics