From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f

tbraun96 · December 30, 2025, 8:24pm

Previously, I released a working containerization of vLLM to run Qwen3-Next-80B-A3B-NVFP4. It ran at a very usable 20 tokens/second. However, with more tinkering, I was able to nearly double the performance to 35 tokens/second.

Medium article of the nuances of getting it to work (had to compile FlashInfer using 12.1f, NOT 12.0): https://blog.thomaspbraun.com/from-20-to-35-tokens-second-optimizing-nvfp4-inference-on-blackwell-gb10-306a84bff467/

Example run:

docker run -d --gpus all --ipc=host -p 8001:8888 -v ~/.cache/huggingface:/root/.cache/huggingface -e MODEL=RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4 --name vllm-nvfp4 avarok/vllm-dgx-spark:v14 serve

eugr · December 30, 2025, 8:30pm

Still slow… NVFP4 support is not ready for prime time on DGX Spark yet…

I was getting 44 t/s out of FP8 version on a single Spark. 4-bit quants should be almost 2x faster.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --port 8000 \
  --num-prompts 1

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  2.79
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.36
Output token throughput (tok/s):         42.61
Peak output token throughput (tok/s):    44.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          46.90
---------------Time to First Token----------------
Mean TTFT (ms):                          131.13
Median TTFT (ms):                        131.13
P99 TTFT (ms):                           131.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.55
Median TPOT (ms):                        22.55
P99 TPOT (ms):                           22.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           22.55
Median ITL (ms):                         22.38
P99 ITL (ms):                            25.39
==================================================

tbraun96 · December 30, 2025, 8:37pm

Slow compared to where it will be once support comes out. Getting NVFP4 to work now is an investment in the future. Also, it’s nice having a much smaller model in memory.

I do remember getting 40-ish on 8-bit quants, like DevQuasar’s image. Thus, this will be blazing fast once there is official “prime time” support.

eugr · December 30, 2025, 8:51pm

I mean, in Qwen3-Next case it doesn’t make much sense to lose accuracy AND speed at the same time.

Although, recompiling flashinfer from the source with 12.1f (or 12.1a if you don’t care about future compatibility) arch is a good idea. I don’t know what flags they use by default, but in theory, if they use 12.0f during build, it should use ptxas to recompile stuff when it encounters 12.1a.

I’ll try to include flashinfer builds into our community Docker builds - if it helps to avoid crashes using certain NVFP4 quants, it would be good. So thanks for the pointers!

eugr · December 30, 2025, 9:01pm

Also, as a FYI, “f” suffix in arch code (ex. 12.1f) doesn’t mean enabling flash-attention features, it just means that the compiler will produce code that can be run on the entire arch. family. “a” suffix targets specific arch and enables unique features for that arch if any - see 5.1. Compute Capabilities — CUDA Programming Guide for details.

Funny things is that if you ask any LLM about it, it will start telling it about MXFP8, etc, because people were speculating in the blogs. Always good to doublecheck with the source.

eugr · December 31, 2025, 12:37am

Just out of curiosity, decided to run this model on my container built from pre-built vLLM nightly wheels (from earlier today) and flashinfer-0.6.0rc2 pre-release (flashinfer-python, flashinfer-cubin, flashinfer-jit-cache), and I’m getting the same 35 t/s:

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  3.41
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.29
Output token throughput (tok/s):         34.93
Peak output token throughput (tok/s):    36.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          38.45
---------------Time to First Token----------------
Mean TTFT (ms):                          71.15
Median TTFT (ms):                        71.15
P99 TTFT (ms):                           71.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.26
Median TPOT (ms):                        28.26
P99 TPOT (ms):                           28.26
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.26
Median ITL (ms):                         28.19
P99 ITL (ms):                            30.22
==================================================

Launching simply with the following command on a single spark:

vllm serve RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
  --host 0.0.0.0 --port 8888 \
  --gpu-memory-utilization 0.7 \
  --max-model-len 32768 \
  --load-format fastsafetensors

Flashinfer versions:

flashinfer-cubin==0.6.0rc2
flashinfer-jit-cache==0.6.0rc2+cu130
flashinfer-python==0.6.0rc2

No idea why you were getting slower speeds before, though.

Maybe there were some improvement between 0.5.3 and 0.6.0rc2, maybe bumping up the vLLM version did the trick, but it’s definitely not compiling flashinfer from source.

Or maybe you had flashinfer-jit-cache installed from cu129 wheels.

BTW, the only place where FLASHINFER_CUDA_ARCH_LIST really matters is when you build flashinfer-jit-cache package, because it’s the only one actually compiling things with nvcc. Flashinfer-python is a pure Python package. There is also ‘flashinfer-cubin’ that includes kernel definitions, but those are downloaded.

If flashinfer-jit-cache is missing, it will just compile the relevant code on first launch of the model. EDIT: it does, but fails with OOM. I’m rebuilding it from source to see if it makes any difference from the one from cu130 wheels.

stor11 · January 6, 2026, 8:11am

I can confirm similar observations on my DGX Spark (GB10). Currently, I am hitting a wall at 35 tps (single stream) for NVFP4 and 44 tps for FP8.

For a pure NVFP4 execution on Blackwell, this seems way too low. My goal is high-throughput RAG with massive parallelism. Running a stress test with 100 concurrent batches, I am capping out at ~680 system tps.

For comparison on the same hardware:

GPT-OSS:120B (MXFP4): Reaches ~1300 system tps.
Qwen3-30B-A3B: Reaches ~1700 system tps (though quality is too low for me).

The logs clearly indicate that we are not running native NVFP4 yet. `vllm` is falling back to **Cutlass** instead of using **FlashInfer**, which likely explains the missing performance delta.
vllm-nvfp4-opt  | (Worker pid=161) INFO 01-06 07:29:03 [gpu_model_runner.py:3702] Starting to load model RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4...
vllm-nvfp4-opt  | (Worker pid=161) WARNING 01-06 07:29:04 [compressed_tensors.py:742] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
vllm-nvfp4-opt  | (Worker pid=161) INFO 01-06 07:29:04 [compressed_tensors_w4a4_nvfp4.py:63] Using cutlass for NVFP4 GEMM
vllm-nvfp4-opt  | (Worker pid=161) WARNING 01-06 07:29:04 [nvfp4_moe_support.py:47] FlashInfer kernels unavailable for CompressedTensorsW4A4Nvfp4MoEMethod on current platform.
vllm-nvfp4-opt  | (Worker pid=161) INFO 01-06 07:29:04 [compressed_tensors_moe.py:253] Using Cutlass for CompressedTensorsW4A4Nvfp4MoEMethod.

Total Tokens:       15000
Total Time:         21.94 Seconds
System Throughput:  683.70 tokens/s (Aggregate)
Avg/User:           6.84 tokens/s

If we can unlock the native FlashInfer kernels for the sm_121 (GB10) architecture, I expect another 25-30% boost . As mentioned by @eugr, the target should be around 80 tps (single) and close to 1700 tps (parallel system throughput) to match the theoretical NVFP4 efficiency.

Has anyone managed to force a clean FlashInfer build that vLLM accepts without falling back to Cutlass?

trystan1 · January 6, 2026, 12:09pm

How are you hitting 1300 tps with 100x concurrency with GPT-OSS-120b? With VLLM I max out at 800 at around 200 requests.

eugr · January 6, 2026, 5:03pm

Cutlass is a correct flashinfer kernel here, as TRT_LLM ones are only supported on sm100 so far.

I compiled Flashinfer from source targeting sm121 arch - there is no performance difference with the prebuilt cu130 builds.

Partially it’s vLLM issue as it has some logic that skips some optimizations on sm121, but just forcing them didn’t work - it seems to be a bit more involved that that. I haven’t looked any further. I’ve noticed that there were a few fixes related to flashinfer in vLLM recently which at least fixed the crashes - maybe I should try patching it again and see if it works this time.

eugr · January 6, 2026, 9:32pm

BTW, I never paid attention to this, but apparently Qwen3-Next models don’t support prefix caching.

It means that if your workloads use multi-turn conversation (chat, coding), it will significantly affect the performance, as it will have to reprocess the entire conversation history on each request.

I found it by running my new benchmarking tool and getting pretty abysmal results and 0% cache utilization in vllm inference logs. I went to vLLM startup logs, and found this message:

Hybrid or mamba-based model detected without support for prefix caching: disabling

Here is llama-benchy output:

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888 --enable-prefix-caching

uv run llama-benchy --base-url http://spark:8888/v1 --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --depth 0 4096 8192 16384 32768 --latency-mode generation --enable-prefix-caching

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048	3078.07 ± 1043.29	845.90 ± 259.80	750.74 ± 259.80	845.94 ± 259.80
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg32	44.68 ± 0.09
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	ctx_pp @ d4096	4161.83 ± 68.28	1079.61 ± 16.30	984.45 ± 16.30	1079.63 ± 16.30
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	ctx_tg @ d4096	44.02 ± 0.07
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d4096	1334.83 ± 6.93	1629.48 ± 7.95	1534.32 ± 7.95	1629.51 ± 7.96
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg32 @ d4096	43.81 ± 0.05
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	ctx_pp @ d8192	3917.16 ± 19.75	2186.43 ± 10.63	2091.28 ± 10.63	2186.47 ± 10.63
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	ctx_tg @ d8192	43.29 ± 0.06
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d8192	768.54 ± 3.42	2760.00 ± 11.82	2664.84 ± 11.82	2760.04 ± 11.81
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg32 @ d8192	42.81 ± 0.17
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	ctx_pp @ d16384	3657.50 ± 25.56	4574.94 ± 31.17	4479.78 ± 31.17	4574.98 ± 31.18
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	ctx_tg @ d16384	41.42 ± 0.10
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d16384	402.06 ± 1.93	5189.02 ± 24.40	5093.86 ± 24.40	5189.05 ± 24.40
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg32 @ d16384	40.99 ± 0.13
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	ctx_pp @ d32768	3357.35 ± 3.55	9855.25 ± 10.31	9760.10 ± 10.31	9855.29 ± 10.31
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	ctx_tg @ d32768	38.55 ± 0.03
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d32768	195.82 ± 0.35	10553.98 ± 18.55	10458.82 ± 18.55	10554.00 ± 18.55
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg32 @ d32768	38.25 ± 0.04

llama-benchy (0.1.1.dev1+g7646c3141.7646c3141)

Notice how slow is prefill for follow up prompts (after ctx load phase), especially as context grows.

stor11 · January 7, 2026, 7:53am

@trystan1 this is my docker:

services:
  # ===============================
  # vLLM – GPT-OSS-120B (bleibt wie er ist)
  # ===============================
  vllm-mxfp4:
    container_name: vllm-mxfp4
    image: nvcr.io/nvidia/vllm:25.09-py3

    command: >
      vllm serve openai/gpt-oss-120b
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --quantization mxfp4
      --max-model-len 32768
      --max-num-seqs 8

    ports:
      - "8000:8000"

    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./harmony:/harmony:ro

    environment:
      HF_TOKEN: ${HF_TOKEN}
      TIKTOKEN_ENCODINGS_BASE: /harmony
      CUDA_VISIBLE_DEVICES: "0"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

    ipc: host
    shm_size: "64gb"
    restart: unless-stopped

and with a batch size of 64 and input from 50 to 250 token i got to around 1300 tps, max 1385. But due to my RAG, I’m more interested in fast parallel responses for inputs in the random size 1k to 8k tokens. but also here, I get to 200 -400 tps.

Topic		Replies	Views
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1833	December 31, 2025
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	1509	January 11, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	3556	February 13, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	1732	December 25, 2025
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	83	4802	February 24, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	4770	February 24, 2026
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	69	1514	February 10, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	16	1431	February 4, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	669	February 13, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	446	December 19, 2025

From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f

Related topics