My units started to consistently crash after 1-2 benchmark runs. I began to suspect bad thermal paste coverage on the GPU/CPU so I went ahead and looked. After checking it out, I can confirm that the factory-applied thermal paste appears degraded and is cracked in both of my units. I believe this has resulted in uneven coverage across the GPU/CPU surfaces, likely contributing in reducing effective heat transfer and contributing to the overheating issues.
While I don’t recommend disassembling your units, if you happen to, I would recommend 1mm thermal pads and whatever thermal paste of personal preference to replace the factory paste.
Cleaned the Front with compressed air.
Having it stand on the Side with an external fan in front.
Survived a night of lama-benchy full tests on all models.
Thanks for all the detail and for sharing your stack write‑up — that’s super helpful context.
1. RAM / freezes
On Spark the 128 GB of LPDDR5X is a single pool that both the OS and GPU workloads share, so if the model + KV‑cache + framework overhead grow close to 128 GB the system can become unresponsive rather than cleanly throwing OOM.
A few things you can try:
Leave more headroom in your auto‑gmem logic. Even if CUDA reports ~90 GiB “free”, there’s additional host‑side overhead (framework, runtime, page tables, file cache, other containers, etc.). Instead of targeting ~0.64, try something more conservative like 0.45–0.50 and see if the freezes go away. Also consider lowering max context length and batch size on the largest models.
Enable swap on the NVMe (if you haven’t already). A 64–128 GB swapfile on the internal SSD won’t make OOM events fast, but it often turns “hard lockup” into “slow but survives”, which is much nicer for a basement box.
Tighten Linux memory accounting so you fail allocations instead of hanging:
Set vm.overcommit_memory=2 and something like vm.overcommit_ratio=90 in /etc/sysctl.d/ so userspace gets allocation failures before the kernel is out of options.
After changing, reboot and watch dmesg/journalctl -k during a run; you want to see clean OOMs, not GPU driver threads stuck for >120s.
Optional safety net: some folks run a tiny “memory guard” systemd service that periodically checks /proc/meminfo and, if MemAvailable drops below a floor (e.g. 4–8 GB), logs a warning and kills the top RAM‑using process. That’s not mandatory, but it’s a simple way to keep the box from becoming completely unresponsive when experimenting with aggressive settings.
If you have a log from a freeze (previous boot), grabbing the last ~200 kernel lines from journalctl -b -1 -k would help confirm whether it’s a GPU allocation path or generic host OOM.
2. Thermals / 95 °C reboots
Per the DGX Spark docs, the platform is designed for continuous high‑load operation in an ambient of roughly 5–30 °C, with an integrated thermal management system. Hitting an ACPI sensor value of ~95 °C and then rebooting means you’re crossing a platform‑level thermal safety limit, so the shutdown is protective rather than a random crash.
A few practical checks:
Ambient and placement
Make sure the room itself stays well under 30 °C, especially once summer hits.
Keep the chassis on a hard, open surface with several cm of clearance on all sides, no soft material or walls blocking the vents, and no other hot gear immediately around it. A big fan in front helps, but if the hot air has nowhere to go it will still recycle.
Software / firmware level
Ensure you’re on the latest DGX OS / firmware bundle for Spark (latest public release has a number of stability and telemetry improvements). If you’re comfortable sharing, the output of uname -a and the Spark OS version from the dashboard would be useful.
Under‑load telemetry
While running a heavy Qwen3.5 prompt (before it reboots), capture:
sudo nvidia-smi -q -d TEMPERATURE,POWER
sensors
This helps see whether the GPU is thermally throttling as expected or the system is instead hitting a hard platform limit.
As an advanced experiment, you can try slightly capping GPU clocks to reduce peak power draw during your longest runs, e.g.:
sudo nvidia-smi -rgc # reset any previous limits
sudo nvidia-smi -lgc 0,2300 # example: cap max graphics clock a bit below default
If that turns reboots into clean, long‑running benchmarks, it’s a strong hint that you’re right on the edge of a power/thermal guard rail, and that data is very helpful for the team that’s tuning Spark’s curves.
3. What would help debug further
If you’re up for it, the following from a failing run would really help:
Exact Spark OS / kernel version and whether any firmware updates have been applied.
journalctl -b -1 -k | tail -n 200 after a thermal reboot.
The memory and temperature telemetry mentioned above from a run that pushes the box close to failure.
With that we can line your repro up against our internal testing and advise whether you’re hitting a known limit vs. something we should treat as a bug on your particular unit.
thanks for the hint.
For anyone who likes to know how to do and undo that here are the commands.
# Reset any previous limits
sudo nvidia-smi -rgc
# Cap max graphics clock (example: 2300 MHz)
sudo nvidia-smi -lgc 0,2300
I am more a fan of overclocking than “under clocking” - not sure if that is even a term -
will try to improve the airflow and cooling first before slowing down the machine.
Hello Neill, thanks for the detailed reply. I will go throuh point by point.
Yes that is a good idea. What really confused me is that if the DGX Dashboard shows 126.5GB Used the system suddenly gets super slow, Mouse and keyboard inputs are delayed massively.
Up to the point that I have ask myself it the RAM is defect or not all 128GB are usable.
Sofar I reduced the “aim” to use the RAM from 128GB down to 126GB
I am new to Linux so no I did not create a swap file so far. Have ask the AI how to do it and thats what it came up with. Will see how it goes. As some here like wentbackward recommended not to use a swap file at all.
sudo fallocate -l 64G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make it persistent across reboots:
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Will do that later maybe if the other changes did not solve the issue.
the thing is I am a big fan of Docker cause I can mess in the containers and it does not influence the OS.
Now when I “mess” with the OS it can have side effects later and cause issues and I am certain I will have forgotten what I had setup here.
Just for completion the AI stands this would be the command
echo "vm.overcommit_memory=2" | sudo tee /etc/sysctl.d/99-spark-oom.conf
echo "vm.overcommit_ratio=90" | sudo tee -a /etc/sysctl.d/99-spark-oom.conf
# Apply the changes without needing an immediate reboot:
sudo sysctl -p /etc/sysctl.d/99-spark-oom.conf
Sounds interesting
I mite do that later
Here is the code - AI Generated / HAVE NOT TRIED IT!
4. Optional: Set Up a Memory Guard Service (Version 1) If you want the “safety net” script Neill mentioned, here is a simple bash script and systemd service that checks for available memory and kills the largest consumer if it drops below 5 GB.
Create the script:
sudo nano /usr/local/bin/mem-guard.sh
Paste this inside:
#!/bin/bash
# Minimum available memory in kB (5 GB = 5000000 kB)
MIN_MEM=5000000
while true; do
AVAILABLE_MEM=$(grep MemAvailable /proc/meminfo | awk '{print $2}')
if [ "$AVAILABLE_MEM" -lt "$MIN_MEM" ]; then
echo "WARNING: MemAvailable dropped to ${AVAILABLE_MEM} kB! Killing top RAM process." | logger -t mem-guard
# Find the PID of the process using the most memory and kill it
TOP_PID=$(ps -eo pid,%mem --sort=-%mem | awk 'NR==2{print $1}')
kill -9 $TOP_PID
fi
sleep 5
done
The Spark is sitting on my Desk behind my Monitor.
The room temperature is usually around 21°C (69.8° F)
There is about 10cm (4 Inch) clearance around the Spark
What I noticed and mentioned before is that the design pattern on the front is really like a filter and gets clogged with dust.
So far its running fine
nvidia-smi -q -d TEMPERATURE
==============NVSMI LOG==============
Timestamp : Mon May 4 22:56:13 2026
Driver Version : 580.126.09
CUDA Version : 13.0
Attached GPUs : 1
GPU 0000000F:01:00.0
Temperature
GPU Current Temp : 52 C
GPU T.Limit Temp : 45 C
GPU Shutdown T.Limit Temp : N/A
GPU Slowdown T.Limit Temp : N/A
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
uname -a
Linux dgx-spark 6.17.0-1014-nvidia #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 17 19:01:40 UTC 2026 aarch64 aarch64 aarch64 GN
U/Linux
dgxtop
=== System Information ===
Platform: Linux-6.17.0-1014-nvidia-aarch64-with-glibc2.39
rchitecture: aarch64
Python Version: 3.12.3
CPU Cores: 20
Memory Total: 121.69 GB
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
I can not see the Spark OS version from the dashboard - but I have clicked “Update” each time it appeared. Where else can I see the Spark OS version, and how can I update SMI Diver to a newer one and also CUDA to a newer one.
sudo nvidia-smi -q -d TEMPERATURE,POWER
==============NVSMI LOG==============
Timestamp : Mon May 4 23:13:13 2026
Driver Version : 580.126.09
CUDA Version : 13.0
Attached GPUs : 1
GPU 0000000F:01:00.0
Temperature
GPU Current Temp : 52 C
GPU T.Limit Temp : 44 C
GPU Shutdown T.Limit Temp : N/A
GPU Slowdown T.Limit Temp : N/A
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
GPU Power Readings
Average Power Draw : 16.14 W
Instantaneous Power Draw : 15.81 W
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
GPU Memory Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
sensors
mt7925_phy0-pci-90100
Adapter: PCI adapter
temp1: N/A
acpitz-acpi-0
Adapter: ACPI interface
temp1: +58.4°C
temp2: +50.6°C
temp3: +51.6°C
temp4: +51.6°C
temp5: +58.4°C
temp6: +51.8°C
temp7: +53.4°C
nvme-pci-40100
Adapter: PCI adapter
Composite: +49.9°C (low = -273.1°C, high = +82.8°C)
(crit = +84.8°C)
Sensor 1: +53.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +49.9°C (low = -273.1°C, high = +65261.8°C)
and whether any firmware updates have been applied.
fwupdmgr get-history
NVIDIA NVIDIA_DGX_Spark
│
├─Embedded Controller:
│ │ Device ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
│ │ Previous version: 0x02004e18
│ │ Update State: Success
│ │ Last modified: 2026-04-29 13:37
│ │ GUID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
│ │ Device Flags: • Internal device
│ │ • Updatable
│ │ • System requires external power source
│ │ • Supported on remote server
│ │ • Needs a reboot after installation
│ │ • Reported to remote server
│ │ • Device is usable for the duration of the update
│ │ • Signed Payload
│ │
│ └─DGX Spark EC FW Embedded Controller Update:
│ New version: 0x03000302
│ Remote ID: lvfs
│ Release ID: 139394
│ Summary: DGX Spark Embedded Controller Firmware Update
│ License: Proprietary
│ Size: 519.3 kB
│ Created: 2026-04-02
│ Urgency: High
│ Tested by NVIDIA:
│ Tested: 2026-04-10
│ Distribution: ubuntu 24.04
│ Old version: 0x02004b03
│ Version[fwupd]: 1.9.34
│ Vendor: NVIDIA
│ Duration: 30 seconds
│ Release Flags: • Trusted metadata
│ • Tested by trusted vendor
│ Description:
│ This update improves the performance and stability of the Embedded Controller in DGX Spark
│ Checksum: XXXXXXXXXXXXXXXXXXXXXX
│
├─UEFI Device Firmware:
│ │ Device ID: XXXXXXXXXXXXXXXXXXXXXXXXXXX
│ │ Previous version: 0x0200941a
│ │ Update State: Success
│ │ Last modified: 2026-04-29 13:37
│ │ GUID: XXXXXXXXXXXXXXXXXXXXXXXXXX
│ │ Device Flags: • Internal device
│ │ • Updatable
│ │ • System requires external power source
│ │ • Supported on remote server
│ │ • Needs a reboot after installation
│ │ • Reported to remote server
│ │ • Device is usable for the duration of the update
│ │ • Signed Payload
│ │
│ └─DGX Spark SoC FW System Update:
│ New version: 0x0200980f
│ Remote ID: lvfs
│ Release ID: 139396
│ Summary: DGX Spark SoC Firmware Update
│ License: Proprietary
│ Size: 30.4 MB
│ Created: 2026-04-02
│ Urgency: High
│ Tested by NVIDIA:
│ Tested: 2026-04-28
│ Distribution: ubuntu 24.04
│ Old version: 0x0200941b
│ Version[fwupd]: 1.9.33
│ Tested by NVIDIA:
│ Tested: 2026-04-15
│ Distribution: ubuntu 24.04
│ Old version: 0x0200941a
│ Version[fwupd]: 1.9.34
│ Tested by NVIDIA:
│ Tested: 2026-04-10
│ Distribution: ubuntu 24.04
│ Old version: 0x02009009
│ Version[fwupd]: 1.9.34
│ Vendor: NVIDIA
│ Duration: 30 seconds
│ Release Flags: • Trusted metadata
│ • Tested by trusted vendor
│ Description:
│ This update improves the performance and stability of the System-on-Chip Firmware including UEFI and GPU in DGX Spark
│ Checksum: XXXXXXXXXXXXXXXXXXXXXX
│
└─UEFI Device Firmware:
│ Device ID: XXXXXXXXXXXXXXXXXXXXXXXXXX
│ Previous version: 0x00000507
│ Update State: Success
│ Last modified: 2026-04-29 13:38
│ GUID: XXXXXXXXXXXXXXXXXXXXXXXXXX
│ Device Flags: • Internal device
│ • Updatable
│ • System requires external power source
│ • Needs a reboot after installation
│ • Reported to remote server
│ • Device is usable for the duration of the update
│ • Signed Payload
│
└─(null) Update:
New version: 0x00000516
Description:
The vendor did not supply any release notes.
The crash log is pretty long so I gave it to an AI to analyse it - you can open the content here.
2. Analysis of Your `crash_log.txt`
The log you provided captures the exact moment the system began to fail and initiated a shutdown sequence. Here is the critical sequence of events:
The Container Teardown The vast majority of the log (from 21:08 to 21:10) shows Docker virtual ethernet interfaces (veth) and network bridges entering disabled states. This means your containers were actively crashing or being killed.
The GPU Driver Crash At 21:10:51, there is a critical NVIDIA driver assertion failure:
NVRM: nvAssertOkFailedNoLog: Assertion failed: Requested object not found [NV_ERR_OBJECT_NOT_FOUND] (0x00000057) returned from pRmApi->Control( pRmApi, hClient, hDevice, NV0080_CTRL_CMD_INTERNAL_MEMSYS_SET_ZBC_REFERENCED, ¶ms, sizeof(params)) @ mem_mgr_gm107.c:283
This indicates the GPU memory manager (mem_mgr) encountered a fatal error. An object it expected to find in memory was missing (NV_ERR_OBJECT_NOT_FOUND). This perfectly aligns with Neill’s theory about shared RAM/GPU memory exhaustion causing the driver to panic and fail rather than throwing a clean Out-Of-Memory (OOM) error.
The System Shutdown Immediately after the GPU memory manager crashes, the file systems unmount and the hardware watchdog takes over at 21:11:03, forcing a system shutdown:
systemd-shutdown[1]: Using hardware watchdog 'SBSA Generic Watchdog'... Sending SIGTERM to remaining processes...
I just had the OOM ~300Mhz/~1000Mhz clock limiter today hit hard today. It was my fault, trying to compile while running inference and I saw the temp was high and then the swap went to 100% and I didn’t catch it fast enough. Bluetooth keyboard lagged on wake up.
Anyway I did the power down, unplugged the Spark after a minute and booted up, but no dice. Still at ~300Mhz/~1000Mhz. Did it again. Still ~300Mhz/~1000Mhz. Unplugged again and went and watched a movie, came back and and its still ~300Mhz/~1000Mhz.
I started thinking I really cooked it this time, but figured I would spool up vllm, send a ‘Hello’ prompt and see what happened. Clock sputtered about for a bit and then jumped to 2434Mhz/~3003Mhz. I am getting too old for this.
Thought I would report my experience here in case someone else runs into the same situation. Seems like you need to run something that hits the GPU hard to actually see if the clock is stuck or not. It was, now its not.
I was having similar issues and manage to limit to only power.
vLLM 0.20 was the culprit in my case. I use Qwen3.5 122B and thought 0.20 might perform better. It was leaking memory under my workload - crashes on the internals which accumulate to use all the memory available after a day. I then had to - still - fight with the thermals. Using 4 concurrent long sessions caused 95C and power down. The speed cap was 2400 and survived more than 12 hours. I am now setting it to 2300. I hope it will solve the issue.