you guys are wimps! I got mine up to 100º running ComfyUI.
Worked fine overnight then started crashing the last 4 hours. Which is annoying. The load has not been any different, it loads in data, runs compute, and evicts data from the GPU. For DF-based workflows it runs in 4GB chunks and for ML models it is full model load and execute.
Debugging this has been annoying as well, my metrics are partially dropped for whatever reason for the last 5 minutes before shutdown at least and it appears to be that way for the logs on journalctl.
The fact that temp5 (ramping up CPU) at 90C causes the fans to spin but not temp6 at 90C from lm_sensors tells me either whatever is at temp6 is either never being properly cooled or was properly cooled but somehow has worse cooling and I can’t even figure it out without taking the device apart and voiding warranty.
These workloads are majority GPU-based, nvidia power statistics that I have recorded tell me that it goes all the way up to 100W.
temp5 reaches 90C only when I run on the performance cores and revs up the fans (basically 100W to CPU perf cores and GPU is then capped to 50W)
temp6 reaches 90C on my workloads and fans don’t do anything (100W to GPU)
GPU hits high 70s low 80s range and spins up fans.
Is temp6 the copy engines region?
@aniculescu or @margaretz could you help @eggman with this thermal issue?
Tried to replicate memory workloads and what do you know, model.to(“cuda”) and model.to(“cpu”) spamming actually causes the fan to spin and I haven’t seen temp6 breach 95C and the average is much closer to 90C and dips below 90C at times. Waiting for fans to rev down and it immediately crosses 95C.
Maybe this is a specific case of low disk / low cpu use and primarily GPU use, data is fed through 10G ethernet, and the fan just isn’t aggressive enough for it.
I will test the next day while forcing the fans on using the efficiency cores since from my old tests it is only a 10W gain compared to the 100W perf cores.
If it is purely because of thermal shutdowns it might be explained by the fact that the all my shutdowns have been during peak room heat generation times while running overnight is just cool enough not to spike it over.
Created a ticket now that there is a burning smell. Pretty much convinced this is a thermal issue now.
I only use popular things like vLLM, Ollama,… Sometimes I compiled wheel files from source. Even my NIC is hot too.
Slow fan (2RPM) suspected on DGX Spark / GB10. I type sensors and see 2rpm, when I touch the front side fan spins really slowly even under system load. Shouldn’t the fan spin really fast in those scenarios? I run CUDA job using lm studio cli (lms) I suspect that the system thermal of this machine is not very good designed, even after installing FW updates.
Hey, I have the same exact issue. as you know, the thermal settings / fan curve control is a mess. What worked for me is locking the gpu clock speed , you could test different clock speeds, what I settled on was 2000Mhz , 2rpm is erroneous, don’t pay attention to that , I also upgraded to bios 0103 and you can see the difference in fan settings with that firmware , it was an improvement doing the bios upgrade but was still not enough , have to limit the clock speed.. Hope that helps!
Spark is now shutting down after about 5 minutes of use. This is getting close to lemon-law territory… This is just a simple Wan 2.2 ComfyUI workflow as well; nothing special about it.
NVidia - can you weigh in here - do we return these for replacement?
Please run NVIDIA DGX Spark Field Diagnostics | NVIDIA , share the resulting logs with me via DM. We’ll make an RMA decision from there. thank you.
Will do, thanks.
I’m on the Gigabyte GB10 that has thermal cooling (got it specifically for this) and it gets warm but does quite well overall, even under extended load
Any chance NVIDIA could stick some fan speed controls into their firmware? Fans on mine spin way too slow, barely audible when temps are touching 95c, on my strix halo this was easily fixed by ramping up fan speeds, I am struggling to see why that cant be added in a firmware update for spark users? There’s clearly enough complaints about temps and I’d rather not have my expensive hardware cook itself to an early failure.
Hmm, one of my Sparks started shutting down in the same high GPU load / no CPU load situation. Another Spark works fine. What’s interesting, it was the other way around before the latest power delivery firmware update.



