We have so many posts about VLLM now, so I decided to make a new one regarding FP4 quants. As of today, FP4 is not properly utilized in current VLLM builds on our hardware, so you lose a lot of performance picking NVFP4 quants compared to AWQ 4-bit ones. Here is a comparison between Qwen3-VL-235-A2…

Exciting! Going to try this tonight. Nvidia’s official guide for Nemotron-3-Super recommends setting an aggressive 5 predicted tokens.

I’ve got better results with 1, the draft acceptance rate drops significantly after that.

[image] eugr: Actually, it does now, starting from yesterday build :) I was hoping to report back with full success, but… I’m getting crashes now even after disabling MTP. Though it did at least start with MTP enabled, which I take as forward progress! I junked a bunch of cached stuff just n…

It should work with the build from the wheels (so without --rebuild-* flags), and try the recipe I posted above first. It WILL crash if you use --enable-prefix-caching without --mamba-cache-mode align.

very stable for me using –mamba-cache-mode align in my testing over the last few days. I think nemotron super edges out qwen3.5-122b at the moment. I’ve had decent luck performance wise with MTP k=3 but defer to spark-goat @eugr

Yes, looks like MTP n=3 is the upper bound of what’s feasible. At n=5 it starts outputting gibberish after a while. I’m still having issues benchmarking MTP as it depends a lot on the prompt and what the model was trained on, so even my experimental branch of llama-benchy that tries to simulate a r…

is it only me or on latest vllm 0.19 with nvfp4 gemms flashinfer_cutlass (default choice at the moment) marlin at least 20% more performant? vllm version 0.19.1rc1.dev15+g50cd5674b.d20260403 flashinfer Release Prebuilt FlashInfer Wheels (0.6.7-e0f3729b-d20260403) - DGX Spark Only · eugr/spark-vllm…

The reinstall worked and this has been running fine since last night: $ ./run-recipe.py nemotron-3-super-nvfp4-flashinfer --max-model-len 1048576 --gpu-memory-utilization 0.87 --solo --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' I tried with MTP=5 for a bit and watched the …

It won’t fit 1M tokens on a single spark even without MTP. With MTP, it basically loads a drafter model in BF16 in addition to the main model, so there is even less space for KV cache.

Any idea what the limit is on a single spark with MTP? It was running fine with 1M before MTP was enabled, though I don’t think the big refactor work got beyond ~480k at any point. Before MTP (and using --gpu-memory-utilization 0.90), vllm start-up messages indicated it had enough room for 1.4 mill…

PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

deanc April 5, 2026, 3:27am 218

Try this - Nemotron-3-Super-120B at 20-22 tok/s Super Special Recipe

Topic		Replies	Views
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	5917	March 28, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1465	January 7, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2188	December 25, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1077	February 13, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2668	December 31, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4107	February 27, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	4667	March 27, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2339	March 26, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark DGX Spark / GB10 jetson , llama , nemotron	7	1465	February 23, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	7678	March 31, 2026

PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

Related topics