Previously, I released a working containerization of vLLM to run Qwen3-Next-80B-A3B-NVFP4. It ran at a very usable 20 tokens/second. However, with more tinkering, I was able to nearly double the performance to 35 tokens/second.
Slow compared to where it will be once support comes out. Getting NVFP4 to work now is an investment in the future. Also, itâs nice having a much smaller model in memory.
I do remember getting 40-ish on 8-bit quants, like DevQuasarâs image. Thus, this will be blazing fast once there is official âprime timeâ support.
I mean, in Qwen3-Next case it doesnât make much sense to lose accuracy AND speed at the same time.
Although, recompiling flashinfer from the source with 12.1f (or 12.1a if you donât care about future compatibility) arch is a good idea. I donât know what flags they use by default, but in theory, if they use 12.0f during build, it should use ptxas to recompile stuff when it encounters 12.1a.
Iâll try to include flashinfer builds into our community Docker builds - if it helps to avoid crashes using certain NVFP4 quants, it would be good. So thanks for the pointers!
Also, as a FYI, âfâ suffix in arch code (ex. 12.1f) doesnât mean enabling flash-attention features, it just means that the compiler will produce code that can be run on the entire arch. family. âaâ suffix targets specific arch and enables unique features for that arch if any - see 5.1. Compute Capabilities â CUDA Programming Guide for details.
Funny things is that if you ask any LLM about it, it will start telling it about MXFP8, etc, because people were speculating in the blogs. Always good to doublecheck with the source.
Just out of curiosity, decided to run this model on my container built from pre-built vLLM nightly wheels (from earlier today) and flashinfer-0.6.0rc2 pre-release (flashinfer-python, flashinfer-cubin, flashinfer-jit-cache), and Iâm getting the same 35 t/s:
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 3.41
Total input tokens: 12
Total generated tokens: 119
Request throughput (req/s): 0.29
Output token throughput (tok/s): 34.93
Peak output token throughput (tok/s): 36.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 38.45
---------------Time to First Token----------------
Mean TTFT (ms): 71.15
Median TTFT (ms): 71.15
P99 TTFT (ms): 71.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 28.26
Median TPOT (ms): 28.26
P99 TPOT (ms): 28.26
---------------Inter-token Latency----------------
Mean ITL (ms): 28.26
Median ITL (ms): 28.19
P99 ITL (ms): 30.22
==================================================
Launching simply with the following command on a single spark:
No idea why you were getting slower speeds before, though.
Maybe there were some improvement between 0.5.3 and 0.6.0rc2, maybe bumping up the vLLM version did the trick, but itâs definitely not compiling flashinfer from source.
Or maybe you had flashinfer-jit-cache installed from cu129 wheels.
BTW, the only place where FLASHINFER_CUDA_ARCH_LIST really matters is when you build flashinfer-jit-cache package, because itâs the only one actually compiling things with nvcc. Flashinfer-python is a pure Python package. There is also âflashinfer-cubinâ that includes kernel definitions, but those are downloaded.
If flashinfer-jit-cache is missing, it will just compile the relevant code on first launch of the model. EDIT: it does, but fails with OOM. Iâm rebuilding it from source to see if it makes any difference from the one from cu130 wheels.
I can confirm similar observations on my DGX Spark (GB10). Currently, I am hitting a wall at 35 tps (single stream) for NVFP4 and 44 tps for FP8.
For a pure NVFP4 execution on Blackwell, this seems way too low. My goal is high-throughput RAG with massive parallelism. Running a stress test with 100 concurrent batches, I am capping out at ~680 system tps.
For comparison on the same hardware:
GPT-OSS:120B (MXFP4): Reaches ~1300 system tps.
Qwen3-30B-A3B: Reaches ~1700 system tps (though quality is too low for me).
The logs clearly indicate that we are not running native NVFP4 yet. `vllm` is falling back to **Cutlass** instead of using **FlashInfer**, which likely explains the missing performance delta.
vllm-nvfp4-opt | (Worker pid=161) INFO 01-06 07:29:03 [gpu_model_runner.py:3702] Starting to load model RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4...
vllm-nvfp4-opt | (Worker pid=161) WARNING 01-06 07:29:04 [compressed_tensors.py:742] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
vllm-nvfp4-opt | (Worker pid=161) INFO 01-06 07:29:04 [compressed_tensors_w4a4_nvfp4.py:63] Using cutlass for NVFP4 GEMM
vllm-nvfp4-opt | (Worker pid=161) WARNING 01-06 07:29:04 [nvfp4_moe_support.py:47] FlashInfer kernels unavailable for CompressedTensorsW4A4Nvfp4MoEMethod on current platform.
vllm-nvfp4-opt | (Worker pid=161) INFO 01-06 07:29:04 [compressed_tensors_moe.py:253] Using Cutlass for CompressedTensorsW4A4Nvfp4MoEMethod.
Total Tokens: 15000
Total Time: 21.94 Seconds
System Throughput: 683.70 tokens/s (Aggregate)
Avg/User: 6.84 tokens/s
If we can unlock the native FlashInfer kernels for the sm_121 (GB10) architecture, I expect another 25-30% boost . As mentioned by @eugr, the target should be around 80 tps (single) and close to 1700 tps (parallel system throughput) to match the theoretical NVFP4 efficiency.
Has anyone managed to force a clean FlashInfer build that vLLM accepts without falling back to Cutlass?
Cutlass is a correct flashinfer kernel here, as TRT_LLM ones are only supported on sm100 so far.
I compiled Flashinfer from source targeting sm121 arch - there is no performance difference with the prebuilt cu130 builds.
Partially itâs vLLM issue as it has some logic that skips some optimizations on sm121, but just forcing them didnât work - it seems to be a bit more involved that that. I havenât looked any further. Iâve noticed that there were a few fixes related to flashinfer in vLLM recently which at least fixed the crashes - maybe I should try patching it again and see if it works this time.
BTW, I never paid attention to this, but apparently Qwen3-Next models donât support prefix caching.
It means that if your workloads use multi-turn conversation (chat, coding), it will significantly affect the performance, as it will have to reprocess the entire conversation history on each request.
I found it by running my new benchmarking tool and getting pretty abysmal results and 0% cache utilization in vllm inference logs. I went to vLLM startup logs, and found this message:
Hybrid or mamba-based model detected without support for prefix caching: disabling
and with a batch size of 64 and input from 50 to 250 token i got to around 1300 tps, max 1385. But due to my RAG, Iâm more interested in fast parallel responses for inputs in the random size 1k to 8k tokens. but also here, I get to 200 -400 tps.