Today, NVIDIA has announced that with NVFP4 support, the DGX Spark delivers up to a 2.5x boost in the Qwen 235B model (two DGX Sparks paired).
Boost from what, running on a CPU? Early adopters have been patiently waiting for proper software support. It makes it even worse to read this marketing spin – which probably cost more than a few developers dedicated to getting the stack optimized.
I wonder why they didn’t just post the performance numbers.
I wouldn’t be surprised if they just achieved the same performance we can already get in vLLM.
If anyone tries the newest TRTLLM before me, please post the benchmarks, preferably using llama-benchy.
Just ran two models on it. I can bench it, but can tell right away it’s slower than vllm. The “2.5x” must be compared to full weight model on the same stack?
uv run llama-benchy --base-url http://localhost:8355/v1 --model openai/gpt-oss-120b --depth 0 4096 8192 16384 32768 --latency-mode generation
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.1.1)
Date: 2026-01-09 14:47:15
Benchmarking model: openai/gpt-oss-120b at http://localhost:8355/v1
Loading text from cache: /home/joseph/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 140865
Warming up...
Warmup (User only) complete (no usage stats found).
Warmup (System+Empty) complete (no usage stats found).
Measuring latency using mode: generation...
Average latency (generation): 121.13 ms
Running test: pp=2048, tg=32, depth=0
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}
Error: 400 - {"object":"error","message":"error downloading or loading vocab file: failed to download or load vocab file","type":"internal_error","param":null,"code":400}
And meanwhile TensorRT-LLM (following verbatim arguments from playbook except increased context length to 64K) said this:
[01/09/2026-20:45:49] [TRT-LLM] [I] get signal from executor worker
INFO: Started server process [150]
INFO: Waiting for application startup.
INFO: Application startup complete.
[01/09/2026-20:47:17] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file
INFO: 127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
[01/09/2026-20:47:18] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file
INFO: 127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
[01/09/2026-20:47:18] [TRT-LLM] [E] Error in harmony chat completion: %s error downloading or loading vocab file: failed to download or load vocab file
INFO: 127.0.0.1:51222 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
It won’t serve tokens in Open webui either. It looks like some kind of template configuration problem, but it should have pulled everything it needed from either HF or the container.
I was able to get llama3 8b running 1.2.0rc6 earlier today. I’ll try to benchmark that next.
I used the script in the playbook which looks like it pulls them. I’ll check again. Meanwhile, benched llama3.1-8b-instruct-fp4 and here are the results:
So it wins PP against vllm at some depths, loses at others. Strange that speed goes up from 4096 to 8192 for sure. Wasn’t doing anything else on the system at the time.
t/g is abysmal across the board relative to vllm. I’’m not sure what the advantage is, and I definitely don’t know what the 2.5x comparison is to.
Will look at gpt-oss script again and see if I can bench it.
I’m unsure if you’re strictly talking about the recent announcement/live stream, but Nvidia hosts the model nvidia/Qwen3-14B-FP4 which is NVFP4. It’s much faster than vanilla Qwen3-14B.
right.- that’s the misleading part I think. Being 2.5x faster than a full weight model with 4 bit quant is a ridiculously low bar. Particularly with sparks limitations. The real comparisons are other quants and inference providers — and it loses there.
Total time for 2MP generation (20 steps) is consistently 260s. EDIT: this was including a reference image as conditioning input. Just plain text to image is a lot faster at 89s. By comparison the same prompt and workflow using NVFP4 took 198s.
Which is is fine for my purposes but I am somewhat regretting having sold the 5090 I bought at GTC to a friend to buy my DGX Spark instead. Especially since it only cost $2100 tax incl (albeit that was a reward for standing in a long line in the dark).
I just read an interesting article Jensen Steps In from the estimable Business Insider. It claims that the Nvidia CEO noticed the disquiet among Spark users after it was introduced and intervened to get attention placed on improving support for software. According to the article, it was complaints from well-placed customers what did it! So if any of you are well-placed, please complain!
Thanks for testing! Yes, the generation speeds are disappointing, but those prompt processing speeds on the other hand… At least we know it can be faster!