Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10)

_cjg · December 17, 2025, 12:29pm

I tested the unsloth UD-Q4_K_XL quant (single spark). The tg128 of llama-bench (b7445) are 63,6 t/s (pp512 2127,8 t/s). But my intern test in the llama.cpp webgui showed higher numbers (74t/s),. With most other models, it was actually always the other way around.

MackenzieNVIDIA · December 17, 2025, 6:58pm

Confirming that Nemotron 3 Nano 30B is supported on DGX Spark via llama.cpp. We are actively working on Playbooks and enabling additional support via SG Lang and vLLM - stay tuned!

maruv5f6p · December 18, 2025, 12:27pm

It’s possible that everyone’s DGX Spark is already running, but I have confirmed that the NVIDIA‑Nemotron‑3‑Nano‑30B‑A3B‑FP8 model can be executed with TensorRT.

ooze.orb · December 18, 2025, 1:05pm

i can test later today if you like that love to troubleshoot ;)

MackenzieNVIDIA · December 18, 2025, 8:26pm

As per the announcement, a playbook for running Nemotron 3 Nano 30B with llama.cpp on Spark is now available!

ooze.orb · December 18, 2025, 8:57pm

thank you so much

ed.swarthout · December 18, 2025, 10:44pm

Should --no-mmap be added to the server options?

eugr · December 19, 2025, 12:24am

+1 for that - otherwise model loading is much slower.

josephbreda · December 19, 2025, 2:55am

Pretty much forget about using TRT-LL or vLLM unless you want to drive yourself to drink. It’s absolutely astonishing to me that Nvidia can’t keep up with new models and an independent project beats the pants off of them. Unless you are running multiple Sparks, just use llama.cpp. BTW – this is the way nvidia loses it’s way, because they are building a true cross-platform inference engine that actually works.

UPDATE: Just to save folks time, the playbook to run nemotron-nano on TRT-LLM just published 4 days ago by nvidia (here) does not work. Even when pulling the new 1.2.0rc5 docker image.

ralph.lora · December 19, 2025, 4:59pm

vLLM is working great on the spark. I’ve been doing some tests with TRT-LLM and I second that, it’s incredible that not only is the one performing the worse in a cluster, but the total lack of basic support and even in the playbooks it’s a very shy presence. I thought I would see TRT-LLM and dynamo in pretty much every single playbook with contributions from SG-LANG, vLLM, llama.cpp, LM Studio, Ollama, … etc in their own websites and maybe 3rd party playbook section. It was very underwhelming see how little those teams played with the spark and how the community had to struggle so much with very basic things. Things are getting better, but they’re definitely not accelerating.

eugr · December 19, 2025, 7:17pm

I’d say, llama.cpp team was the first to properly support Spark - kudos to them.

NVIDIA is quite active in vLLM dev, so while things were quite rough in the beginning, they improved fairly quickly (but proper FP4 support is still lacking).

SGLang… Well, SGLang is kinda weird. They were part of the marketing campaign, published a few blogs, made tweaks in Triton and SGLang to optimize gpt-oss on Spark… and then abandoned their effort. The changes they’ve made to Triton and SGLang are sitting in one of the forks that is severely behind the main branch, and it seems like no one is going to incorporate them into the respective repos. The developer who’s done the work is not responding to questions.

I wish NVIDIA directed half of their marketing budget to make sure Spark is properly supported in popular platforms, but that’s big companies for you…

josephbreda · December 19, 2025, 10:55pm

I don’t know why they don’t add GB10 to their release targeting. I’m beginning to suspect there is some majority crippled aspect of the chip such that they have decided to quickly orphan it?

eugr · December 20, 2025, 2:34am

They make big bucks selling datacenter products. I mean, GB300 is not released yet, but it’s support has just been added to vLLM.

Consumer Blackwell (and GB10 is a consumer, mostly prosumer grade chip) was released back in January (RTX5090) and support is still lacking…

Topic		Replies	Views
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	272	December 22, 2025
NIM LLM Containers Fail on DGX Spark (GB10): Triton/vLLM Crash on sm_121 and NGC Permission Errors DGX Spark / GB10 jetson , nim , llama , nemotron	2	154	December 3, 2025
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	32	360	December 22, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	17	618	December 20, 2025
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	34	513	December 17, 2025
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	124	December 19, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	2425	December 9, 2025
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	25	1341	December 12, 2025
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	380	December 7, 2025
Running nvidia/Nemotron-Nano-VL-12B-V2-NVFP4-QAD on your spark DGX Spark / GB10 Projects jetson-inference , computer-vision-cv , spark , jetson , nemotron	4	584	November 24, 2025

Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10)

Related topics