We have so many posts about VLLM now, so I decided to make a new one regarding FP4 quants. As of today, FP4 is not properly utilized in current VLLM builds on our hardware, so you lose a lot of performance picking NVFP4 quants compared to AWQ 4-bit ones.
Here is a comparison between Qwen3-VL-235-A22B in NVFP4 quantization and AWQ 4-bit on my dual DGX Spark cluster using Friday version of VLLM main branch (my Docker build). I retested one model with today version, and the performance was the same. I asked Gemini to provide a short summary of the comparison that you can see below. Iβll also post raw data in my first comment.
However, FP8 and AWQ 8-bit perform on roughly the same level, with FP8 being a bit more performant on prompt processing, and AWQ 8-bit slightly overperforming on token generation. Iβm not posting the results here, as I botched a few tests, so not sure if prompt processing numbers are correct.
As for FP4, I tested multiple models, and they all follow the same suit, on both cluster and single machine.
Gemini summary:
Based on the benchmark logs provided from your DGX Spark cluster, here is the comparison between the RedHatAI (NVFP4) and QuantTrio (AWQ) quantizations of the Qwen3-VL-235B model.
Summary of Findings
The QuantTrio (AWQ) quantization consistently outperforms the RedHatAI (NVFP4) model across all metrics in both low (1 request) and high (10 concurrent requests) concurrency scenarios.
Throughput: The AWQ model demonstrates significantly higher output token generation speeds, running roughly 32% faster at single concurrency and 18% faster at high concurrency.
Latency: The AWQ model provides a snappier initial response (Time to First Token) and faster subsequent token generation (Inter-token Latency), making it the superior choice for interactive applications.
Scalability: Both models see degradation in latency as concurrency increases (as expected), but the AWQ model handles the load with less performance penalty than the NVFP4 version.
Comparison Table
The following table compares the key metrics extracted from your vllm bench runs.
Metric
Concurrency
RedHatAI (NVFP4)
QuantTrio (AWQ)
Delta (AWQ vs NVFP4)
Output Tokens/s
1
18.91
24.93
+31.8% (Faster)
10
35.58
42.11
+18.3% (Faster)
Request Throughput
1
0.16 req/s
0.21 req/s
+31.2%
10
0.14 req/s
0.17 req/s
+21.4%
Mean TTFT (ms) (Time to First Token)
1
199.66
170.23
-14.7% (Faster)
10
1049.56
1009.22
-3.8% (Faster)
Mean ITL (ms) (Inter-Token Latency)
1
51.62
39.01
-24.4% (Faster)
10
106.38
90.07
-15.3% (Faster)
Detailed Observations
Single Request Performance:
At a single concurrent request, the AWQ model is significantly more efficient. The Inter-Token Latency (ITL) drops from ~51ms (NVFP4) to ~39ms (AWQ). This results in a much smoother generation experience for a single user.
Concurrency Scaling:
When ramping up to 10 concurrent requests, the NVFP4 model struggles slightly more than the AWQ model. While both models see a jump in Time to First Token (TTFT) due to queuing/scheduling (rising from ~180ms to over 1000ms), the AWQ model maintains a higher total token throughput (42.11 tok/s vs 35.58 tok/s), indicating better utilization of the DGX GPU resources under load.
Recommendation:
Unless there is a specific accuracy requirement that strictly demands the NVFP4 quantization format, the QuantTrio AWQ build is the more performant choice for this specific hardware configuration and workload.
This is interesting. Iβve seen some other posts using RTX6000, and their findings were similar to mine.
On the other hand, even though both are blackwell, GB10 and RTX6000 are slightly different architectures - RTX6000 is sm120 and GB10 is sm121β¦ It looks like FP4 kernels are not optimized for sm121 yet.
Within vLLM, there are separate optimizations and kernels for Mixture-of-Experts (MoE) models like Qwen3-VL-235-A22B and dense models like Llama 3.3 70B. So, it would not be surprising to see a difference in performance across quantization formats when comparing MoE vs Dense models, especially for newer formats like NVFP4.
I point this out because it may explain why one tester gets a different result using dense models than seen here using MoE models. They are not both equally optimized in vLLM and other lower level layers yet for NVFP4.
So since spark released itβs apparent that many of the frameworks and libraries didnβt support arm. And so thereβs a period of catch up happening.
My question is, at what level is the incompatibility?
Is it the libraries themselves? (vLLM, Paddle paddle etc) or is it a framework they all depend upon? or both?
I think ARM support has been good for a while. There is nothing special about Spark ARM processor - it has a standard aarch64 architecture that was around long enough that all major packages/libraries build on it just fine.
Itβs more about the GPU-specific support. It is Blackwell, but itβs a consumer-level Blackwell, and unlike 5090 and RTX 6000 Pro that share sm120, it has its own arch code - sm121. So, many libraries (including mainline pytorch) donβt know about sm121 yet. It will come, but will take some time. Hopefully less that it took for sm120.
True, but it looks like some libraries treat sm121 as different from sm120 and donβt include blackwell-specific code in the builds. Even pyTorch complains about it (not sure if it affects anything though).
It depends. If the model was post-trained in fp4 (similar to gpt-oss and some nemotron models), then NVFP4 will be a βnativeβ quant and will have better accuracy than bf16 model quantized to 4-bit AWQ.
On the other hand, if they are both quantized from bf16, it depends on whether w4a4 or w4a16 was used. AWQ quants normally use 16 bits for activation weights and 4 bits for others. Most NVFP4 quants in the wild that Iβve seen are w4a4, so in theory, AWQ should have better accuracy. NVFP4 w4a16 should be more accurate than AWQ though.
Also, not every quantization process is the same, it all depends on the calibration dataset, etc.
Itβs somewhat comical watching people scramble on how to determine capability in hardware with sm120 sm121 sm100 and their respecive a and f variant build flags.
To their credit itβs a total rubegoldbergian machine on top of an already incredible delicate/nuanced process.
Still getting the autotuner debug output on launch but the crashing/illegal instruction errors are gone. Going to try some speed/accuracy tests a little later but so far this looks really promising.
Edit: 20 minutes after posting this, crashed. Nevermind. Wish I had a definitive reproduction but itβs just time/benchmarking that makes it appear :(