I tested the unsloth UD-Q4_K_XL quant (single spark). The tg128 of llama-bench (b7445) are 63,6 t/s (pp512 2127,8 t/s). But my intern test in the llama.cpp webgui showed higher numbers (74t/s),. With most other models, it was actually always the other way around.
Confirming that Nemotron 3 Nano 30B is supported on DGX Spark via llama.cpp. We are actively working on Playbooks and enabling additional support via SG Lang and vLLM - stay tuned!
It’s possible that everyone’s DGX Spark is already running, but I have confirmed that the NVIDIA‑Nemotron‑3‑Nano‑30B‑A3B‑FP8 model can be executed with TensorRT.
i can test later today if you like that love to troubleshoot ;)
As per the announcement, a playbook for running Nemotron 3 Nano 30B with llama.cpp on Spark is now available!
thank you so much
Should --no-mmap be added to the server options?
+1 for that - otherwise model loading is much slower.
Pretty much forget about using TRT-LL or vLLM unless you want to drive yourself to drink. It’s absolutely astonishing to me that Nvidia can’t keep up with new models and an independent project beats the pants off of them. Unless you are running multiple Sparks, just use llama.cpp. BTW – this is the way nvidia loses it’s way, because they are building a true cross-platform inference engine that actually works.
UPDATE: Just to save folks time, the playbook to run nemotron-nano on TRT-LLM just published 4 days ago by nvidia (here) does not work. Even when pulling the new 1.2.0rc5 docker image.
vLLM is working great on the spark. I’ve been doing some tests with TRT-LLM and I second that, it’s incredible that not only is the one performing the worse in a cluster, but the total lack of basic support and even in the playbooks it’s a very shy presence. I thought I would see TRT-LLM and dynamo in pretty much every single playbook with contributions from SG-LANG, vLLM, llama.cpp, LM Studio, Ollama, … etc in their own websites and maybe 3rd party playbook section. It was very underwhelming see how little those teams played with the spark and how the community had to struggle so much with very basic things. Things are getting better, but they’re definitely not accelerating.
I’d say, llama.cpp team was the first to properly support Spark - kudos to them.
NVIDIA is quite active in vLLM dev, so while things were quite rough in the beginning, they improved fairly quickly (but proper FP4 support is still lacking).
SGLang… Well, SGLang is kinda weird. They were part of the marketing campaign, published a few blogs, made tweaks in Triton and SGLang to optimize gpt-oss on Spark… and then abandoned their effort. The changes they’ve made to Triton and SGLang are sitting in one of the forks that is severely behind the main branch, and it seems like no one is going to incorporate them into the respective repos. The developer who’s done the work is not responding to questions.
I wish NVIDIA directed half of their marketing budget to make sure Spark is properly supported in popular platforms, but that’s big companies for you…
I don’t know why they don’t add GB10 to their release targeting. I’m beginning to suspect there is some majority crippled aspect of the chip such that they have decided to quickly orphan it?
They make big bucks selling datacenter products. I mean, GB300 is not released yet, but it’s support has just been added to vLLM.
Consumer Blackwell (and GB10 is a consumer, mostly prosumer grade chip) was released back in January (RTX5090) and support is still lacking…