it definitely isn’t scaling linearly with devices, but still… +10% wow!
@eugr Yes. I’ll fit it in when I can in the next few days.
Good day!
Have you had a chance to verify this?
Also, will it work just as fast in the new version as it does in the containers from our forum members @eugr and @christopher_owen ?
MXFP4 optimizations by Christopher only exist in his original container and inside my build (mxfp4 flavor). These changes were not merged into upstream, and no other similar optimizations have been done by anyone else (at least, nothing similar has been merged), so the answer is no, it is still slow in a standard vLLM build.
I just got your MXFP4 docker image up and running. Fantastic work! I haven’t run detailed performance comparisons yet, but I can tell this is represents a dramatic speedup of my prior vLLM docker containers running gpt:oss 120b. Thanks for your sharing your hard work here!
thank you for the kind words! I’m still working away at it in the background hoping to improve the numbers a little more :).
I’m curious, based on what you’ve seen so far, why do you think NVFP4 underperforms on Spark?
I haven’t done a deep NVFP4 dive on Spark yet (so I don’t want to over-claim), but based on what I’ve seen so far in vLLM/FlashInfer with gpt-oss-120b, my working hypotheses for “NVFP4 underperforms” would be along these lines:
-
You only win if the path stays NVFP4 end-to-end. If any part of the stack is unpacking / dequantizing to FP16/FP8 (or doing extra reformatting) before the GEMM/attention kernels, the conversion overhead can erase the bandwidth savings pretty quickly—especially at small or awkward shapes.
-
Tensor cores vs CUDA cores is often the whole ballgame. The goal is to land on tensor-core MMA paths; they’re dramatically faster for these workloads. If NVFP4 leads to a code path that falls back to CUDA-core math (or mixes in a lot of non-TC work like packing, scaling, or format conversions), you can easily lose even if the weights are “smaller.”
-
Kernel maturity + shape coverage matters more than the datatype on paper. A “theoretically faster” format can lose to a better-tuned FP8/INT8 kernel if NVFP4 is hitting less optimal tile shapes, more padding, or fallback kernels. With MoE in particular, you often end up in lots of non-ideal group sizes / strides.
-
On SM120 specifically, small/odd shapes and low-bit formats can get forced into what the hardware supports natively. If the exact small-N / small-M cases aren’t directly supported by the underlying SASS pathways, you can end up routing through supported datatypes/instructions (or doing extra packing/reinterpretation) just to make the kernel legal. That translation step can dominate when the compute is already tiny.
-
Spark can be overhead/bandwidth dominated in places where precision alone doesn’t help. In my own work the big unlocks weren’t “smaller bits = faster” so much as what is actually limiting the pipeline (memory traffic, launches, scheduling bubbles). If NVFP4 adds extra passes (packing/scales) without reducing the dominant traffic, it can net out negative.
Right now I’m heads-down on getting gpt-oss-120b’s kernels into a better place (recently got FC1 + SwiGLU fused, largely to attack launch overhead and reduce stalls/bubbling). Once that’s stable, I’m happy to circle back and properly profile NVFP4 on Spark (Nsight + looking explicitly for conversion/reformat kernels, whether we’re truly on tensor-core MMA for the hot kernels, achieved occupancy, and memory throughput).
If you have a specific model + backend combination where you’re seeing NVFP4 underperform (or one you think is “the right” NVFP4 showcase), tell me which one you’d recommend and I can use that as the next target.
I hope to be announcing some news about gpt-oss-120b in the near future.
Any NVFP4 model that actually runs and doesn’t crash (like Nemotron), but for instance this one:
- RedHatAI/Qwen3-30B-A3B-NVFP4 gives 65 t/s on a single Spark
- QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ gives 83 t/s on a single Spark
I would dearly love more work on NVFP4 models in vLLM!
@eugr has good suggestions, but for an initial push (and maximum potential engagement from Nvidia) my suggestion would be to look at models from Nvidia’s own quants, here: Inference Optimized Checkpoints (with Model Optimizer) - a nvidia Collection . I’d suggest consolidating on standout examples across 3 major categories which, if handled, would likely generalize well.
Two examples from each major category for example:
- A MoE model like
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
- nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4
- A dense model around 8B such as
- nvidia/Llama-3.1-8B-Instruct-NVFP4
- nvidia/Qwen3-8B-NVFP4 (could do 14B or 32B instead)
- A VL or multimodal model
- nvidia/Phi-4-multimodal-instruct-NVFP4
- nvidia/Qwen2.5-VL-7B-Instruct-NVFP4
I believe pretty much all of these have competing AWQ quants available for comparison.
I’d avoid Qwen3-Next for now - it’s one of the few models where AWQ quants perform even worse than FP8.
Sorry for delay – more benchmarking but this time using @eugr’s repo (GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks) at commit: ace16f3a8f307b72e0916ad156836f8853301c6c:
Build: ./build-and-copy.sh -t vllm-node-mxfp4 --exp-mxfp4 -c
Running on 4 sparks with configuration matching repo README except 4 sparks and tp=4. Since a lot of time has passed (1 week is a lifetime in this stuff… there could be a number of confounding variables in the differences between the benchmarks, but hopefully it’s still interesting information…)
Same benchmarking as my previous post: llama-benchy --base-url "http://10.24.11.13:8000/v1" --model "openai/gpt-oss-120b" --tokenizer "openai/gpt-oss-120b" --pp 512 2048 8192 --tg 32 128 --runs 5
TP=4, Attempt #1
| model | test | t/s | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------------------|-------:|------------------:|--------------:|-----------------:|-----------------:|-----------------:|-----------------:|
| openai/gpt-oss-120b | pp512 | 3593.92 ± 106.31 | | | 143.85 ± 4.29 | 142.59 ± 4.29 | 182.91 ± 3.85 |
| openai/gpt-oss-120b | tg32 | 82.01 ± 0.51 | 84.96 ± 0.53 | 84.96 ± 0.53 | | | |
| openai/gpt-oss-120b | pp512 | 3655.55 ± 19.13 | | | 141.32 ± 0.73 | 140.06 ± 0.73 | 180.59 ± 1.57 |
| openai/gpt-oss-120b | tg128 | 83.10 ± 0.19 | 84.00 ± 0.00 | 84.00 ± 0.00 | | | |
| openai/gpt-oss-120b | pp2048 | 4515.44 ± 804.20 | | | 466.81 ± 67.68 | 465.56 ± 67.68 | 521.86 ± 77.28 |
| openai/gpt-oss-120b | tg32 | 59.15 ± 13.56 | 61.28 ± 14.05 | 61.28 ± 14.05 | | | |
| openai/gpt-oss-120b | pp2048 | 6303.56 ± 18.19 | | | 326.16 ± 0.94 | 324.90 ± 0.94 | 365.63 ± 1.16 |
| openai/gpt-oss-120b | tg128 | 82.33 ± 0.08 | 83.00 ± 0.00 | 83.00 ± 0.00 | | | |
| openai/gpt-oss-120b | pp8192 | 7992.99 ± 2700.50 | | | 1166.21 ± 419.94 | 1164.95 ± 419.94 | 1219.83 ± 435.62 |
| openai/gpt-oss-120b | tg32 | 67.41 ± 16.10 | 69.83 ± 16.68 | 69.83 ± 16.68 | | | |
| openai/gpt-oss-120b | pp8192 | 10980.30 ± 47.53 | | | 747.33 ± 3.22 | 746.08 ± 3.22 | 787.53 ± 3.10 |
| openai/gpt-oss-120b | tg128 | 81.69 ± 0.21 | 82.60 ± 0.49 | 82.60 ± 0.49 | | | |
llama-benchy (0.3.0)
date: 2026-02-10 13:50:14 | latency mode: api
TP=4, Attempt #2
| model | test | t/s | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------------------|-------:|------------------:|--------------:|-----------------:|---------------:|---------------:|----------------:|
| openai/gpt-oss-120b | pp512 | 3577.69 ± 31.85 | | | 144.23 ± 1.28 | 143.12 ± 1.28 | 183.14 ± 1.86 |
| openai/gpt-oss-120b | tg32 | 81.46 ± 0.26 | 84.40 ± 0.26 | 84.40 ± 0.26 | | | |
| openai/gpt-oss-120b | pp512 | 3583.50 ± 34.15 | | | 144.00 ± 1.36 | 142.89 ± 1.36 | 183.27 ± 1.36 |
| openai/gpt-oss-120b | tg128 | 82.64 ± 0.26 | 83.40 ± 0.49 | 83.40 ± 0.49 | | | |
| openai/gpt-oss-120b | pp2048 | 6287.21 ± 13.38 | | | 326.85 ± 0.69 | 325.74 ± 0.69 | 366.62 ± 0.82 |
| openai/gpt-oss-120b | tg32 | 81.84 ± 0.83 | 84.81 ± 0.85 | 84.81 ± 0.85 | | | |
| openai/gpt-oss-120b | pp2048 | 5406.21 ± 1059.75 | | | 396.48 ± 84.44 | 395.37 ± 84.44 | 445.44 ± 96.06 |
| openai/gpt-oss-120b | tg128 | 70.92 ± 13.58 | 75.20 ± 12.50 | 75.20 ± 12.50 | | | |
| openai/gpt-oss-120b | pp8192 | 10931.13 ± 26.76 | | | 750.54 ± 1.83 | 749.42 ± 1.83 | 790.99 ± 1.59 |
| openai/gpt-oss-120b | tg32 | 80.36 ± 0.16 | 83.27 ± 0.16 | 83.27 ± 0.16 | | | |
| openai/gpt-oss-120b | pp8192 | 10973.44 ± 53.43 | | | 747.66 ± 3.64 | 746.55 ± 3.64 | 787.65 ± 3.86 |
| openai/gpt-oss-120b | tg128 | 81.22 ± 0.48 | 82.00 ± 0.63 | 82.00 ± 0.63 | | | |
llama-benchy (0.3.0)
date: 2026-02-10 13:51:31 | latency mode: api
TP=4, Attempt #3
| model | test | t/s | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------------------|-------:|------------------:|--------------:|-----------------:|---------------:|---------------:|----------------:|
| openai/gpt-oss-120b | pp512 | 3694.73 ± 40.52 | | | 142.03 ± 1.54 | 138.59 ± 1.54 | 180.71 ± 2.20 |
| openai/gpt-oss-120b | tg32 | 81.97 ± 0.45 | 84.92 ± 0.46 | 84.92 ± 0.46 | | | |
| openai/gpt-oss-120b | pp512 | 3481.35 ± 383.17 | | | 152.67 ± 19.64 | 149.23 ± 19.64 | 195.67 ± 29.06 |
| openai/gpt-oss-120b | tg128 | 70.61 ± 15.14 | 71.80 ± 15.10 | 71.80 ± 15.10 | | | |
| openai/gpt-oss-120b | pp2048 | 5059.94 ± 983.86 | | | 422.90 ± 75.70 | 419.46 ± 75.70 | 471.10 ± 84.06 |
| openai/gpt-oss-120b | tg32 | 69.21 ± 15.24 | 71.71 ± 15.79 | 71.71 ± 15.79 | | | |
| openai/gpt-oss-120b | pp2048 | 6294.43 ± 16.52 | | | 328.81 ± 0.86 | 325.37 ± 0.86 | 368.07 ± 1.42 |
| openai/gpt-oss-120b | tg128 | 82.43 ± 0.26 | 83.20 ± 0.40 | 83.20 ± 0.40 | | | |
| openai/gpt-oss-120b | pp8192 | 10977.91 ± 14.83 | | | 749.67 ± 1.01 | 746.23 ± 1.01 | 789.70 ± 1.45 |
| openai/gpt-oss-120b | tg32 | 79.90 ± 0.44 | 82.81 ± 0.44 | 82.81 ± 0.44 | | | |
| openai/gpt-oss-120b | pp8192 | 10636.71 ± 713.10 | | | 777.47 ± 57.69 | 774.03 ± 57.69 | 821.19 ± 66.40 |
| openai/gpt-oss-120b | tg128 | 76.43 ± 9.57 | 79.00 ± 6.51 | 79.00 ± 6.51 | | | |
llama-benchy (0.3.0)
date: 2026-02-10 13:52:48 | latency mode: api
good lord this is cool.
vLLM 0.16 has been tagged
(Including the patch for CVE-2026-0994)
I’m running some batch decisioning workloads and here is the throughput from a single spark. I’m seeing >200tk/s with high KV cache usage. This is using the latest spark-vllm-docker and build/recipe for gpt-oss-120b. So far I’ve ran 8+ hour runs under sustained processing without any issues.
(APIServer pid=176) INFO 02-12 21:21:51 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 220.0 tokens/s, Running: 55 reqs, Waiting: 35 reqs, GPU KV cache usage: 99.5%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:22:01 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 220.0 tokens/s, Running: 55 reqs, Waiting: 36 reqs, GPU KV cache usage: 99.7%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:22:11 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 214.5 tokens/s, Running: 55 reqs, Waiting: 36 reqs, GPU KV cache usage: 99.9%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:22:21 [loggers.py:257] Engine 000: Avg prompt throughput: 2059.6 tokens/s, Avg generation throughput: 148.2 tokens/s, Running: 55 reqs, Waiting: 36 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:22:31 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 220.0 tokens/s, Running: 55 reqs, Waiting: 37 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:22:41 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 220.0 tokens/s, Running: 55 reqs, Waiting: 37 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:22:51 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 130.1 tokens/s, Running: 55 reqs, Waiting: 37 reqs, GPU KV cache usage: 98.7%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:23:01 [loggers.py:257] Engine 000: Avg prompt throughput: 1779.0 tokens/s, Avg generation throughput: 181.3 tokens/s, Running: 55 reqs, Waiting: 38 reqs, GPU KV cache usage: 99.4%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:23:11 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 220.0 tokens/s, Running: 55 reqs, Waiting: 38 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:23:21 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 214.5 tokens/s, Running: 55 reqs, Waiting: 38 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:23:31 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 214.5 tokens/s, Running: 55 reqs, Waiting: 39 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.1%
(APIServer pid=176) INFO 02-12 21:23:41 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 194.6 tokens/s, Running: 54 reqs, Waiting: 40 reqs, GPU KV cache usage: 98.5%, Prefix cache hit rate: 0.1%
Is this using gpt-oss-20b? 200 TPS is quite a bit higher than I would expect for gpt-oss-120b.
This is gpt-oss-120b on a single spark, but I’m running batch jobs multithreaded (note 54 requests running, 40 requests waiting)
./build-and-copy.sh -t vllm-node-mxfp4 --exp-mxfp4
docker run
–privileged
–gpus all
-it --rm
–network host
–ipc=host
-v /home/edison/Downloads/vllm/models/:/models
vllm-node-mxfp4
bash -c “vllm serve /models/gpt-oss-120b
–host 0.0.0.0
–port 8000
–enable-auto-tool-choice
–tool-call-parser openai
–reasoning-parser openai_gptoss
–gpu-memory-utilization 0.70
–enable-prefix-caching
–load-format fastsafetensors
–quantization mxfp4
–mxfp4-backend CUTLASS
–mxfp4-layers moe,qkv,o,lm_head
–attention-backend FLASHINFER
–kv-cache-dtype fp8
–max-num-batched-tokens 8192”
llama-benchy --base-url “``http://0.0.0.0:8000/v1”`` --model “/models/gpt-oss-120b” --pp 512 2048 8192 --tg 32 128 --runs 5
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| /models/gpt-oss-120b | pp512 | 2208.78 ± 20.00 | 233.24 ± 2.10 | 231.82 ± 2.10 | 282.19 ± 2.52 | |
| /models/gpt-oss-120b | tg32 | 62.44 ± 0.16 | 64.68 ± 0.17 | |||
| /models/gpt-oss-120b | pp512 | 2206.37 ± 28.06 | 233.51 ± 2.94 | 232.09 ± 2.94 | 282.31 ± 3.63 | |
| /models/gpt-oss-120b | tg128 | 62.57 ± 0.03 | 63.00 ± 0.00 | |||
| /models/gpt-oss-120b | pp2048 | 4912.91 ± 25.66 | 418.29 ± 2.18 | 416.87 ± 2.18 | 467.58 ± 2.46 | |
| /models/gpt-oss-120b | tg32 | 61.62 ± 0.15 | 63.83 ± 0.16 | |||
| /models/gpt-oss-120b | pp2048 | 4937.15 ± 28.36 | 416.25 ± 2.38 | 414.83 ± 2.38 | 465.72 ± 2.32 | |
| /models/gpt-oss-120b | tg128 | 61.89 ± 0.06 | 62.80 ± 0.40 | |||
| /models/gpt-oss-120b | pp8192 | 6813.77 ± 24.33 | 1203.71 ± 4.28 | 1202.29 ± 4.28 | 1254.79 ± 4.35 | |
| /models/gpt-oss-120b | tg32 | 59.79 ± 0.12 | 61.93 ± 0.12 | |||
| /models/gpt-oss-120b | pp8192 | 6802.16 ± 21.47 | 1205.76 ± 3.80 | 1204.33 ± 3.80 | 1256.88 ± 3.92 | |
| /models/gpt-oss-120b | tg128 | 60.06 ± 0.05 | 61.00 ± 0.00 |
Do these performance numbers look reasonable for a single dgx spark with DGX OS 7.4 and 590.48.01 driver ?
Looks good
I finally got around to trying a PR upstream. Wish me luck.