PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

We have so many posts about VLLM now, so I decided to make a new one regarding FP4 quants. As of today, FP4 is not properly utilized in current VLLM builds on our hardware, so you lose a lot of performance picking NVFP4 quants compared to AWQ 4-bit ones.

Here is a comparison between Qwen3-VL-235-A22B in NVFP4 quantization and AWQ 4-bit on my dual DGX Spark cluster using Friday version of VLLM main branch (my Docker build). I retested one model with today version, and the performance was the same. I asked Gemini to provide a short summary of the comparison that you can see below. I’ll also post raw data in my first comment.

However, FP8 and AWQ 8-bit perform on roughly the same level, with FP8 being a bit more performant on prompt processing, and AWQ 8-bit slightly overperforming on token generation. I’m not posting the results here, as I botched a few tests, so not sure if prompt processing numbers are correct.

As for FP4, I tested multiple models, and they all follow the same suit, on both cluster and single machine.

Gemini summary:

Based on the benchmark logs provided from your DGX Spark cluster, here is the comparison between the RedHatAI (NVFP4) and QuantTrio (AWQ) quantizations of the Qwen3-VL-235B model.

Summary of Findings

The QuantTrio (AWQ) quantization consistently outperforms the RedHatAI (NVFP4) model across all metrics in both low (1 request) and high (10 concurrent requests) concurrency scenarios.

  • Throughput: The AWQ model demonstrates significantly higher output token generation speeds, running roughly 32% faster at single concurrency and 18% faster at high concurrency.
  • Latency: The AWQ model provides a snappier initial response (Time to First Token) and faster subsequent token generation (Inter-token Latency), making it the superior choice for interactive applications.
  • Scalability: Both models see degradation in latency as concurrency increases (as expected), but the AWQ model handles the load with less performance penalty than the NVFP4 version.

Comparison Table

The following table compares the key metrics extracted from your vllm bench runs.

Metric Concurrency RedHatAI (NVFP4) QuantTrio (AWQ) Delta (AWQ vs NVFP4)
Output Tokens/s 1 18.91 24.93 +31.8% (Faster)
10 35.58 42.11 +18.3% (Faster)
Request Throughput 1 0.16 req/s 0.21 req/s +31.2%
10 0.14 req/s 0.17 req/s +21.4%
Mean TTFT (ms)
(Time to First Token)
1 199.66 170.23 -14.7% (Faster)
10 1049.56 1009.22 -3.8% (Faster)
Mean ITL (ms)
(Inter-Token Latency)
1 51.62 39.01 -24.4% (Faster)
10 106.38 90.07 -15.3% (Faster)

Detailed Observations

  1. Single Request Performance:
    At a single concurrent request, the AWQ model is significantly more efficient. The Inter-Token Latency (ITL) drops from ~51ms (NVFP4) to ~39ms (AWQ). This results in a much smoother generation experience for a single user.

  2. Concurrency Scaling:
    When ramping up to 10 concurrent requests, the NVFP4 model struggles slightly more than the AWQ model. While both models see a jump in Time to First Token (TTFT) due to queuing/scheduling (rising from ~180ms to over 1000ms), the AWQ model maintains a higher total token throughput (42.11 tok/s vs 35.58 tok/s), indicating better utilization of the DGX GPU resources under load.

  3. Recommendation:
    Unless there is a specific accuracy requirement that strictly demands the NVFP4 quantization format, the QuantTrio AWQ build is the more performant choice for this specific hardware configuration and workload.

7 Likes

Raw data

I ran 1 and 10 concurrent requests for each model in this order (to avoid caching of all the prompt data).

RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4

vllm serve RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --max-model-len 32768
vllm bench serve   --backend vllm   --model RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4   --endpoint /v1/completions   --dataset-name sharegpt   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json   --num-prompts 1   --port 8888 --host spark
                            Output tokens per second
  25 +----------------------------------------------------------------------+
     |                                                                      |
     |                                                                      |
     |                  ******           ****************                   |
  20 |                **      ***********                *                  |
     |             ***                                    *                 |
     |           **                                        *                |
     |    *******                                           *               |
  15 |****                                                   *              |
     |                                                       *              |
     |                                                        *             |
  10 |                                                         *            |
     |                                                          *           |
     |                                                           *          |
     |                                                            *         |
   5 |                                                             **       |
     |                                                               **     |
     |                                                                 **   |
     |                                                                   ** |
   0 +----------------------------------------------------------------------+
     0         1         2         3          4         5         6         7

                          Concurrent requests per second
    1 +---------------------------------------------------------------------+
      |                                                            *        |
      |                                                            *        |
      |                                                             *       |
  0.8 |                                                             *       |
      |                                                              *      |
      |                                                              *      |
      |                                                               *     |
  0.6 |                                                               *     |
      |                                                                *    |
      |                                                                *    |
  0.4 |                                                                 *   |
      |                                                                 *   |
      |                                                                  *  |
      |                                                                  *  |
  0.2 |                                                                   * |
      |                                                                   * |
      |                                                                    *|
      |                                                                    *|
    0 +---------------------------------------------------------------------+
      0         1         2         3         4         5         6         7
============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  6.29
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.16
Output token throughput (tok/s):         18.91
Peak output token throughput (tok/s):    21.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          20.82
---------------Time to First Token----------------
Mean TTFT (ms):                          199.66
Median TTFT (ms):                        199.66
P99 TTFT (ms):                           199.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.62
Median TPOT (ms):                        51.62
P99 TPOT (ms):                           51.62
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.62
Median ITL (ms):                         48.05
P99 ITL (ms):                            79.97
==================================================
                            Output tokens per second
  60 +----------------------------------------------------------------------+
     | **  *                                                                |
     | * * **                                                               |
  50 |*  ** ** **                                                           |
     |*      * **                                                           |
     |*      * * *  *         **                                            |
  40 |*       *  **** * **  * * *                                           |
     |*       *      ****** **   *                                          |
     |*       *      * * *** *   *                                          |
  30 |*                          ************* ***** **** ** ** **          |
     |*                                       *     *    *  *  *  *         |
     |*                                                           *         |
     |                                                             *        |
  20 |                                                             *        |
     |                                                             *        |
     |                                                              *       |
  10 |                                                              *       |
     |                                                              *       |
     |                                                               *      |
   0 +----------------------------------------------------------------------+
     0        10       20       30       40      50       60       70       80

                         Concurrent requests per second
  10 +----------------------------------------------------------------------+
     |*                                                                     |
     |*                                                                     |
     | *                                                                    |
   8 | *                                                                    |
     | *                                                                    |
     | **********                                                           |
     |           *                                                          |
   6 |           ****                                                       |
     |               *********                                              |
     |                        *                                             |
   4 |                        ****                                          |
     |                           *                                          |
     |                           **************                             |
     |                                         *                            |
   2 |                                         ********************         |
     |                                                             *        |
     |                                                             **       |
     |                                                               *      |
   0 +----------------------------------------------------------------------+
     0        10       20       30       40      50       60       70       80
============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  71.21
Total input tokens:                      1374
Total generated tokens:                  2534
Request throughput (req/s):              0.14
Output token throughput (tok/s):         35.58
Peak output token throughput (tok/s):    56.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          54.88
---------------Time to First Token----------------
Mean TTFT (ms):                          1049.56
Median TTFT (ms):                        1143.46
P99 TTFT (ms):                           1145.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          121.63
Median TPOT (ms):                        130.11
P99 TPOT (ms):                           145.40
---------------Inter-token Latency----------------
Mean ITL (ms):                           106.38
Median ITL (ms):                         99.50
P99 ITL (ms):                            154.85
==================================================

QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ

vllm serve QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --max-model-len 32768 --load-format fastsafetensors
vllm bench serve   --backend vllm   --model QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ   --endpoint /v1/completions   --dataset-name sharegpt   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json   --num-prompts 1   --port 8888 --host spark
                            Output tokens per second
  30 +----------------------------------------------------------------------+
     |                                                                      |
     |                                                                      |
  25 |          ***********************************                         |
     |   *******                                   ****                     |
     |***                                              *****                |
  20 |                                                      ***             |
     |                                                         *            |
     |                                                          *           |
  15 |                                                           *          |
     |                                                            *         |
     |                                                             *        |
     |                                                              *       |
  10 |                                                               **     |
     |                                                                 *    |
     |                                                                  *   |
   5 |                                                                   *  |
     |                                                                    * |
     |                                                                     *|
   0 +----------------------------------------------------------------------+
     0             1             2              3             4             5

                          Concurrent requests per second
    1 +---------------------------------------------------------------------+
      |                                                        *            |
      |                                                        *            |
      |                                                         *           |
  0.8 |                                                          *          |
      |                                                           *         |
      |                                                           *         |
      |                                                            *        |
  0.6 |                                                             *       |
      |                                                              *      |
      |                                                              *      |
  0.4 |                                                               *     |
      |                                                                *    |
      |                                                                 *   |
      |                                                                 *   |
  0.2 |                                                                  *  |
      |                                                                   * |
      |                                                                    *|
      |                                                                    *|
    0 +---------------------------------------------------------------------+
      0             1             2             3             4             5
============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  4.77
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.21
Output token throughput (tok/s):         24.93
Peak output token throughput (tok/s):    26.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          27.44
---------------Time to First Token----------------
Mean TTFT (ms):                          170.23
Median TTFT (ms):                        170.23
P99 TTFT (ms):                           170.23
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.01
Median TPOT (ms):                        39.01
P99 TPOT (ms):                           39.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.01
Median ITL (ms):                         38.69
P99 ITL (ms):                            41.63
==================================================
                            Output tokens per second
  70 +----------------------------------------------------------------------+
     |                                                                      |
     | ****  *                                                              |
  60 | *   * **                                                             |
     |*    ** * * *                                                         |
  50 |*        * ***         **                                             |
     |*          * *         * *                                            |
     |*             * *** ***   *                                           |
  40 |*              *   *      *   * * ** *                                |
     |*                          *** * *  * ******* ******** ** *           |
     |*                                            *        *  * *          |
  30 |*                                                          *          |
     |                                                           *          |
     |                                                           *          |
  20 |                                                           *          |
     |                                                            *         |
  10 |                                                            *         |
     |                                                            *         |
     |                                                            *         |
   0 +----------------------------------------------------------------------+
     0         10        20        30         40        50        60        70

                         Concurrent requests per second
  10 +----------------------------------------------------------------------+
     |*                                                                     |
     |*                                                                     |
     |*                                                                     |
   8 | *                                                                    |
     | *                                                                    |
     | *********                                                            |
     |          *                                                           |
   6 |          ****                                                        |
     |              *********                                               |
     |                       *                                              |
   4 |                       ***                                            |
     |                          *                                           |
     |                          **************                              |
     |                                        *                             |
   2 |                                        ********************          |
     |                                                            *         |
     |                                                            *         |
     |                                                             *        |
   0 +----------------------------------------------------------------------+
     0         10        20        30         40        50        60        70
============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  60.18
Total input tokens:                      1374
Total generated tokens:                  2534
Request throughput (req/s):              0.17
Output token throughput (tok/s):         42.11
Peak output token throughput (tok/s):    63.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          64.94
---------------Time to First Token----------------
Mean TTFT (ms):                          1009.22
Median TTFT (ms):                        1319.45
P99 TTFT (ms):                           1321.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          102.65
Median TPOT (ms):                        109.34
P99 TPOT (ms):                           121.93
---------------Inter-token Latency----------------
Mean ITL (ms):                           90.07
Median ITL (ms):                         82.15
P99 ITL (ms):                            123.82
==================================================
2 Likes

Interesting, it’s the opposite of what https://medium.com/data-science-collective/nvfp4-same-accuracy-with-2-3x-higher-throughput-for-4-bit-llms-03518ecba108 observed using the RTX Pro 6000.

Also, interesting he was comparing them using Llama 3.3 70B and that specific model is not present in his collection :

This is interesting. I’ve seen some other posts using RTX6000, and their findings were similar to mine.

On the other hand, even though both are blackwell, GB10 and RTX6000 are slightly different architectures - RTX6000 is sm120 and GB10 is sm121… It looks like FP4 kernels are not optimized for sm121 yet.

Within vLLM, there are separate optimizations and kernels for Mixture-of-Experts (MoE) models like Qwen3-VL-235-A22B and dense models like Llama 3.3 70B. So, it would not be surprising to see a difference in performance across quantization formats when comparing MoE vs Dense models, especially for newer formats like NVFP4.

I point this out because it may explain why one tester gets a different result using dense models than seen here using MoE models. They are not both equally optimized in vLLM and other lower level layers yet for NVFP4.

Some relevant discussion regarding vllm missing specifically nvfp4 support:

Additionally:

I saw your post in the other reply.
This suggests it’s hitting a scenario where the weights aren’t being quantized, which doesn’t seem right

@eugr are you aware of any work in vLLM to optimise nvfp4? Ought it to be faster on blackwell?

According to NVIDIA folks here the work is underway, but I haven’t seen any relevant pull requests in vllm github yet.

You should get a share of every spark sold, because you are adding tremendous value to the community. Thanks!

5 Likes

@eugr I got a question or you.

So since spark released it’s apparent that many of the frameworks and libraries didn’t support arm. And so there’s a period of catch up happening.

My question is, at what level is the incompatibility?

Is it the libraries themselves? (vLLM, Paddle paddle etc) or is it a framework they all depend upon? or both?

I think ARM support has been good for a while. There is nothing special about Spark ARM processor - it has a standard aarch64 architecture that was around long enough that all major packages/libraries build on it just fine.

It’s more about the GPU-specific support. It is Blackwell, but it’s a consumer-level Blackwell, and unlike 5090 and RTX 6000 Pro that share sm120, it has its own arch code - sm121. So, many libraries (including mainline pytorch) don’t know about sm121 yet. It will come, but will take some time. Hopefully less that it took for sm120.

1 Like

sm120 and 121 is equally the same, it is more lack support for sm12x in general, but we are working with frameworks for that

3 Likes

True, but it looks like some libraries treat sm121 as different from sm120 and don’t include blackwell-specific code in the builds. Even pyTorch complains about it (not sure if it affects anything though).

Gene –

I just saw this. Can you comment a bit further as to accuracy of NVFP4 vs. AWQ? Some people would no doubt prefer greater accuracy to faster speed.

It depends. If the model was post-trained in fp4 (similar to gpt-oss and some nemotron models), then NVFP4 will be a β€œnative” quant and will have better accuracy than bf16 model quantized to 4-bit AWQ.

On the other hand, if they are both quantized from bf16, it depends on whether w4a4 or w4a16 was used. AWQ quants normally use 16 bits for activation weights and 4 bits for others. Most NVFP4 quants in the wild that I’ve seen are w4a4, so in theory, AWQ should have better accuracy. NVFP4 w4a16 should be more accurate than AWQ though.

Also, not every quantization process is the same, it all depends on the calibration dataset, etc.

1 Like

Folks, is there any update on the state of support for this?

It would be nice to run the native NVFP4 of nemotron nano. Still seems like the nvfp4 support is lacking.

PRs I’m watching:

[Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths by blake-snc Β· Pull Request #35568 Β· vllm-project/vllm

[Bugfix] Fix uninitialized NVFP4 global scale causing inf overflow by lucaspirola Β· Pull Request #35693 Β· vllm-project/vllm

Enable sm120f compilation by kahyunnam Β· Pull Request #2650 Β· flashinfer-ai/flashinfer

[CuTeDSL] Flash Attention v2 for SM120 (Blackwell GeForce) by blake-snc Β· Pull Request #3030 Β· NVIDIA/cutlass

feat: add CuTe DSL flash attention backend for SM120 GPUs by blake-snc Β· Pull Request #2598 Β· flashinfer-ai/flashinfer

fix: guard CUTLASS FMHA against SM12x and fix fmha_v2 SM121a check by blake-snc Β· Pull Request #2560 Β· flashinfer-ai/flashinfer

Support NVFP4 KV cache decode on SM120 by Tom-Zheng Β· Pull Request #2520 Β· flashinfer-ai/flashinfer

It’s somewhat comical watching people scramble on how to determine capability in hardware with sm120 sm121 sm100 and their respecive a and f variant build flags.

To their credit it’s a total rubegoldbergian machine on top of an already incredible delicate/nuanced process.

6 Likes
1 Like

@johnny_nv @eugr -

I like this blake dude - fix: Software E2M1 conversion for SM12x NVFP4 activation quantization by blake-snc Β· Pull Request #35947 Β· vllm-project/vllm Β· GitHub

I merged his PR into main vllm this morning at this commit [Bugfix] Cap FULL decode cudagraph sizes for Mamba/hybrid models (#34… Β· vllm-project/vllm@d6e04f4 Β· GitHub

And updated flashinfer 0.6.5 with the 120f fix Enable sm120f compilation (#2650) Β· flashinfer-ai/flashinfer@635505f Β· GitHub

Still getting the autotuner debug output on launch but the crashing/illegal instruction errors are gone. Going to try some speed/accuracy tests a little later but so far this looks really promising.

Edit: 20 minutes after posting this, crashed. Nevermind. Wish I had a definitive reproduction but it’s just time/benchmarking that makes it appear :(

1 Like