From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f

Previously, I released a working containerization of vLLM to run Qwen3-Next-80B-A3B-NVFP4. It ran at a very usable 20 tokens/second. However, with more tinkering, I was able to nearly double the performance to 35 tokens/second.

Medium article of the nuances of getting it to work (had to compile FlashInfer using 12.1f, NOT 12.0): https://blog.thomaspbraun.com/from-20-to-35-tokens-second-optimizing-nvfp4-inference-on-blackwell-gb10-306a84bff467/

Example run:

docker run -d --gpus all --ipc=host -p 8001:8888 -v ~/.cache/huggingface:/root/.cache/huggingface -e MODEL=RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4 --name vllm-nvfp4 avarok/vllm-dgx-spark:v14 serve
1 Like

Still slow
 NVFP4 support is not ready for prime time on DGX Spark yet


I was getting 44 t/s out of FP8 version on a single Spark. 4-bit quants should be almost 2x faster.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888
vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --port 8000 \
  --num-prompts 1
============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  2.79
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.36
Output token throughput (tok/s):         42.61
Peak output token throughput (tok/s):    44.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          46.90
---------------Time to First Token----------------
Mean TTFT (ms):                          131.13
Median TTFT (ms):                        131.13
P99 TTFT (ms):                           131.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.55
Median TPOT (ms):                        22.55
P99 TPOT (ms):                           22.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           22.55
Median ITL (ms):                         22.38
P99 ITL (ms):                            25.39
==================================================
1 Like

Slow compared to where it will be once support comes out. Getting NVFP4 to work now is an investment in the future. Also, it’s nice having a much smaller model in memory.

I do remember getting 40-ish on 8-bit quants, like DevQuasar’s image. Thus, this will be blazing fast once there is official “prime time” support.

I mean, in Qwen3-Next case it doesn’t make much sense to lose accuracy AND speed at the same time.

Although, recompiling flashinfer from the source with 12.1f (or 12.1a if you don’t care about future compatibility) arch is a good idea. I don’t know what flags they use by default, but in theory, if they use 12.0f during build, it should use ptxas to recompile stuff when it encounters 12.1a.

I’ll try to include flashinfer builds into our community Docker builds - if it helps to avoid crashes using certain NVFP4 quants, it would be good. So thanks for the pointers!

Also, as a FYI, “f” suffix in arch code (ex. 12.1f) doesn’t mean enabling flash-attention features, it just means that the compiler will produce code that can be run on the entire arch. family. “a” suffix targets specific arch and enables unique features for that arch if any - see 5.1. Compute Capabilities — CUDA Programming Guide for details.

Funny things is that if you ask any LLM about it, it will start telling it about MXFP8, etc, because people were speculating in the blogs. Always good to doublecheck with the source.

Just out of curiosity, decided to run this model on my container built from pre-built vLLM nightly wheels (from earlier today) and flashinfer-0.6.0rc2 pre-release (flashinfer-python, flashinfer-cubin, flashinfer-jit-cache), and I’m getting the same 35 t/s:

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  3.41
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.29
Output token throughput (tok/s):         34.93
Peak output token throughput (tok/s):    36.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          38.45
---------------Time to First Token----------------
Mean TTFT (ms):                          71.15
Median TTFT (ms):                        71.15
P99 TTFT (ms):                           71.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.26
Median TPOT (ms):                        28.26
P99 TPOT (ms):                           28.26
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.26
Median ITL (ms):                         28.19
P99 ITL (ms):                            30.22
==================================================

Launching simply with the following command on a single spark:

vllm serve RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
  --host 0.0.0.0 --port 8888 \
  --gpu-memory-utilization 0.7 \
  --max-model-len 32768 \
  --load-format fastsafetensors

Flashinfer versions:

flashinfer-cubin==0.6.0rc2
flashinfer-jit-cache==0.6.0rc2+cu130
flashinfer-python==0.6.0rc2

No idea why you were getting slower speeds before, though.

Maybe there were some improvement between 0.5.3 and 0.6.0rc2, maybe bumping up the vLLM version did the trick, but it’s definitely not compiling flashinfer from source.

Or maybe you had flashinfer-jit-cache installed from cu129 wheels.

BTW, the only place where FLASHINFER_CUDA_ARCH_LIST really matters is when you build flashinfer-jit-cache package, because it’s the only one actually compiling things with nvcc. Flashinfer-python is a pure Python package. There is also ‘flashinfer-cubin’ that includes kernel definitions, but those are downloaded.

If flashinfer-jit-cache is missing, it will just compile the relevant code on first launch of the model. EDIT: it does, but fails with OOM. I’m rebuilding it from source to see if it makes any difference from the one from cu130 wheels.

I can confirm similar observations on my DGX Spark (GB10). Currently, I am hitting a wall at 35 tps (single stream) for NVFP4 and 44 tps for FP8.

For a pure NVFP4 execution on Blackwell, this seems way too low. My goal is high-throughput RAG with massive parallelism. Running a stress test with 100 concurrent batches, I am capping out at ~680 system tps.

For comparison on the same hardware:

  • GPT-OSS:120B (MXFP4): Reaches ~1300 system tps.

  • Qwen3-30B-A3B: Reaches ~1700 system tps (though quality is too low for me).

The logs clearly indicate that we are not running native NVFP4 yet. `vllm` is falling back to **Cutlass** instead of using **FlashInfer**, which likely explains the missing performance delta.
vllm-nvfp4-opt  | (Worker pid=161) INFO 01-06 07:29:03 [gpu_model_runner.py:3702] Starting to load model RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4...
vllm-nvfp4-opt  | (Worker pid=161) WARNING 01-06 07:29:04 [compressed_tensors.py:742] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
vllm-nvfp4-opt  | (Worker pid=161) INFO 01-06 07:29:04 [compressed_tensors_w4a4_nvfp4.py:63] Using cutlass for NVFP4 GEMM
vllm-nvfp4-opt  | (Worker pid=161) WARNING 01-06 07:29:04 [nvfp4_moe_support.py:47] FlashInfer kernels unavailable for CompressedTensorsW4A4Nvfp4MoEMethod on current platform.
vllm-nvfp4-opt  | (Worker pid=161) INFO 01-06 07:29:04 [compressed_tensors_moe.py:253] Using Cutlass for CompressedTensorsW4A4Nvfp4MoEMethod.
Total Tokens:       15000
Total Time:         21.94 Seconds
System Throughput:  683.70 tokens/s (Aggregate)
Avg/User:           6.84 tokens/s

If we can unlock the native FlashInfer kernels for the sm_121 (GB10) architecture, I expect another 25-30% boost . As mentioned by @eugr, the target should be around 80 tps (single) and close to 1700 tps (parallel system throughput) to match the theoretical NVFP4 efficiency.

Has anyone managed to force a clean FlashInfer build that vLLM accepts without falling back to Cutlass?

How are you hitting 1300 tps with 100x concurrency with GPT-OSS-120b? With VLLM I max out at 800 at around 200 requests.

Cutlass is a correct flashinfer kernel here, as TRT_LLM ones are only supported on sm100 so far.

I compiled Flashinfer from source targeting sm121 arch - there is no performance difference with the prebuilt cu130 builds.

Partially it’s vLLM issue as it has some logic that skips some optimizations on sm121, but just forcing them didn’t work - it seems to be a bit more involved that that. I haven’t looked any further. I’ve noticed that there were a few fixes related to flashinfer in vLLM recently which at least fixed the crashes - maybe I should try patching it again and see if it works this time.

1 Like

BTW, I never paid attention to this, but apparently Qwen3-Next models don’t support prefix caching.

It means that if your workloads use multi-turn conversation (chat, coding), it will significantly affect the performance, as it will have to reprocess the entire conversation history on each request.

I found it by running my new benchmarking tool and getting pretty abysmal results and 0% cache utilization in vllm inference logs. I went to vLLM startup logs, and found this message:

Hybrid or mamba-based model detected without support for prefix caching: disabling

Here is llama-benchy output:

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888 --enable-prefix-caching
uv run llama-benchy --base-url http://spark:8888/v1 --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --depth 0 4096 8192 16384 32768 --latency-mode generation --enable-prefix-caching
model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 3078.07 ± 1043.29 845.90 ± 259.80 750.74 ± 259.80 845.94 ± 259.80
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg32 44.68 ± 0.09
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ctx_pp @ d4096 4161.83 ± 68.28 1079.61 ± 16.30 984.45 ± 16.30 1079.63 ± 16.30
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ctx_tg @ d4096 44.02 ± 0.07
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d4096 1334.83 ± 6.93 1629.48 ± 7.95 1534.32 ± 7.95 1629.51 ± 7.96
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg32 @ d4096 43.81 ± 0.05
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ctx_pp @ d8192 3917.16 ± 19.75 2186.43 ± 10.63 2091.28 ± 10.63 2186.47 ± 10.63
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ctx_tg @ d8192 43.29 ± 0.06
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d8192 768.54 ± 3.42 2760.00 ± 11.82 2664.84 ± 11.82 2760.04 ± 11.81
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg32 @ d8192 42.81 ± 0.17
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ctx_pp @ d16384 3657.50 ± 25.56 4574.94 ± 31.17 4479.78 ± 31.17 4574.98 ± 31.18
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ctx_tg @ d16384 41.42 ± 0.10
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d16384 402.06 ± 1.93 5189.02 ± 24.40 5093.86 ± 24.40 5189.05 ± 24.40
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg32 @ d16384 40.99 ± 0.13
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ctx_pp @ d32768 3357.35 ± 3.55 9855.25 ± 10.31 9760.10 ± 10.31 9855.29 ± 10.31
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ctx_tg @ d32768 38.55 ± 0.03
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d32768 195.82 ± 0.35 10553.98 ± 18.55 10458.82 ± 18.55 10554.00 ± 18.55
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg32 @ d32768 38.25 ± 0.04

llama-benchy (0.1.1.dev1+g7646c3141.7646c3141)

Notice how slow is prefill for follow up prompts (after ctx load phase), especially as context grows.

@trystan1 this is my docker:

services:
  # ===============================
  # vLLM – GPT-OSS-120B (bleibt wie er ist)
  # ===============================
  vllm-mxfp4:
    container_name: vllm-mxfp4
    image: nvcr.io/nvidia/vllm:25.09-py3

    command: >
      vllm serve openai/gpt-oss-120b
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --quantization mxfp4
      --max-model-len 32768
      --max-num-seqs 8

    ports:
      - "8000:8000"

    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./harmony:/harmony:ro

    environment:
      HF_TOKEN: ${HF_TOKEN}
      TIKTOKEN_ENCODINGS_BASE: /harmony
      CUDA_VISIBLE_DEVICES: "0"

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

    ipc: host
    shm_size: "64gb"
    restart: unless-stopped

and with a batch size of 64 and input from 50 to 250 token i got to around 1300 tps, max 1385. But due to my RAG, I’m more interested in fast parallel responses for inputs in the random size 1k to 8k tokens. but also here, I get to 200 -400 tps.