As a personal research, I ran various cyankiwi’s AWQ 4-bit models, NVIDIA’s NVFP4 models and Qwen’s own official FP8 models. All results are based on a single-node setup.
I’m not sure if this forum needs such a rudimentary report, and I feel a bit embarrassed to share it, but I thought I’d share it in case there are people who want to use DGX spark right away in Out-Of-Box state or are planning a trial similar to mine.
Qwen3-Next AWQ 4-bit vs FP8 vs NVFP4
This is a comparison of AWQ 4-bit (cyankiwi) vs. FP8 (Qwen) vs. NVFP4 (NVIDIA), based on Qwen3 instruct.
For context, the original purpose was to benchmark out-of-the-box performance for Spark newcomers right after unboxing, so I used the NVCR vLLM image with minimal option changes and followed the recipe provided in each model card as-is.
If you think there might be meaningful differences with specific vLLM versions or configurations, feel free to share your thoughts. I’ll run those when I get the time and update the results accordingly.
Conclusion
@tbraun96 ‘s combination of vllm x NVFP4 is notably more stable in operation & quality than AWQ 4-bit. In fact, other NVFP4 models — such as Nemotron Nano — were unstable and prone to crashes, so the stable operation observed here is significant in its own right.
In qualitative testing, NVFP4 showed no meaningful difference from FP8. I plan to run a long-term stability test and, if it holds up, replace the current FP8 deployment with this NVFP4 build, because it shows better performance with high concurrency(c>2).
Used recipes for AWQ & FP8
-
vllm docker for AWQ 4bit & FP8
docker run --privileged --gpus all --net=host --ipc=host --name vllm_nvcr -it --rm -v ~/.cache/huggingface:/root/.cache/huggingface nvcr.io/nvidia/vllm:26.01-py3 -
vllm recipe
vllm serve {model} --host 0.0.0.0 --port 8000 --gpu_memory_utilization 0.85 --speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:2}’
model: cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit, Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 -
NVFP4 recipes @ GitHub - Avarok-Cybersecurity/dgx-vllm: A dedicated effort to make an optimized, bleeding edge vLLM image using Docker to support DGX comprehensively
@tbraun96, thank you for your contribution. -
I could not find any t/s differences between FLASH_ATTN and FLASHINFER for AWQ 4bit and FP8.
llama-benchy results
AWQ 4bit
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | pp2048 | 3602.77 ± 726.67 | 601.33 ± 140.02 | 596.68 ± 140.02 | 601.44 ± 140.03 | |
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | tg128 | 32.82 ± 0.03 | 33.67 ± 0.47 | |||
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | pp2048 @ d4096 | 4141.60 ± 54.07 | 1488.38 ± 19.21 | 1483.73 ± 19.21 | 1488.45 ± 19.21 | |
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | tg128 @ d4096 | 32.56 ± 0.18 | 34.00 ± 0.00 | |||
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | pp2048 @ d8192 | 4041.35 ± 96.43 | 2539.88 ± 59.64 | 2535.23 ± 59.64 | 2539.96 ± 59.64 | |
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | tg128 @ d8192 | 32.50 ± 0.43 | 34.67 ± 1.25 | |||
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | pp2048 @ d16384 | 3992.94 ± 11.33 | 4620.82 ± 13.09 | 4616.18 ± 13.09 | 4620.91 ± 13.09 | |
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | tg128 @ d16384 | 31.01 ± 0.03 | 32.00 ± 0.00 | |||
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | pp2048 @ d32768 | 3650.88 ± 13.44 | 9541.11 ± 35.20 | 9536.46 ± 35.20 | 9541.22 ± 35.19 | |
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | tg128 @ d32768 | 29.57 ± 0.22 | 30.67 ± 0.94 | |||
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | pp2048 @ d65535 | 3134.18 ± 8.21 | 21567.88 ± 56.68 | 21563.24 ± 56.68 | 21567.99 ± 56.69 | |
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | tg128 @ d65535 | 26.93 ± 0.20 | 28.33 ± 0.47 | |||
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | pp2048 @ d100000 | 2727.70 ± 5.06 | 37416.32 ± 69.37 | 37411.68 ± 69.37 | 37416.43 ± 69.36 | |
| cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit | tg128 @ d100000 | 24.58 ± 0.31 | 26.00 ± 1.41 |
FP8
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 | 3472.79 ± 355.85 | 601.17 ± 59.20 | 595.78 ± 59.20 | 601.27 ± 59.19 | |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 | 44.56 ± 0.67 | 46.67 ± 1.89 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d4096 | 3770.69 ± 64.86 | 1635.29 ± 28.20 | 1629.90 ± 28.20 | 1635.40 ± 28.20 | |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d4096 | 43.41 ± 0.48 | 45.33 ± 1.70 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d8192 | 3795.74 ± 72.02 | 2704.13 ± 51.73 | 2698.74 ± 51.73 | 2704.24 ± 51.73 | |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d8192 | 42.57 ± 0.55 | 44.33 ± 1.89 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d16384 | 3655.91 ± 39.05 | 5047.67 ± 54.22 | 5042.28 ± 54.22 | 5047.77 ± 54.23 | |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d16384 | 40.60 ± 0.44 | 41.67 ± 0.94 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d32768 | 3383.68 ± 20.22 | 10295.16 ± 61.28 | 10289.77 ± 61.28 | 10295.28 ± 61.27 | |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d32768 | 38.50 ± 0.57 | 40.33 ± 1.89 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d65535 | 2922.68 ± 5.59 | 23129.08 ± 44.29 | 23123.69 ± 44.29 | 23129.17 ± 44.28 | |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d65535 | 33.73 ± 0.10 | 34.67 ± 0.47 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d100000 | 2514.38 ± 3.94 | 40590.94 ± 63.52 | 40585.55 ± 63.52 | 40591.06 ± 63.50 | |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d100000 | 30.14 ± 0.11 | 31.00 ± 0.00 |
NVFP4
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 | 4758.90 ± 7.31 | 497.65 ± 0.66 | 430.35 ± 0.66 | 497.72 ± 0.66 | |
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 | 39.54 ± 0.03 | 40.00 ± 0.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d4096 | 4262.97 ± 11.34 | 1508.55 ± 3.83 | 1441.26 ± 3.83 | 1508.61 ± 3.84 | |
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d4096 | 39.02 ± 0.02 | 40.00 ± 0.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d8192 | 4102.79 ± 20.83 | 2563.22 ± 12.63 | 2495.93 ± 12.63 | 2563.28 ± 12.63 | |
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d8192 | 38.69 ± 0.01 | 39.00 ± 0.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d16384 | 3802.17 ± 5.30 | 4915.06 ± 6.76 | 4847.76 ± 6.76 | 4915.12 ± 6.76 | |
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d16384 | 37.92 ± 0.02 | 39.00 ± 0.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d32768 | 3379.46 ± 12.13 | 10369.66 ± 36.88 | 10302.37 ± 36.88 | 10369.72 ± 36.88 | |
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d32768 | 36.78 ± 0.04 | 37.67 ± 0.47 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d65535 | 2645.75 ± 2.06 | 25611.32 ± 19.92 | 25544.03 ± 19.92 | 25611.39 ± 19.93 | |
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d65535 | 34.59 ± 0.01 | 36.00 ± 0.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d100000 | 2251.31 ± 42.68 | 45411.66 ± 848.74 | 45344.37 ± 848.74 | 45411.73 ± 848.74 | |
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d100000 | 32.57 ± 0.05 | 34.00 ± 0.00 |
Concurrency test
FP8
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 (c1) | 43.74 ± 0.00 | 43.74 ± 0.00 | 44.00 ± 0.00 | 44.00 ± 0.00 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 (c2) | 3799.01 ± 5.44 | 804.97 ± 273.21 | 0.00 ± 0.00 | 805.00 ± 273.20 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 (c2) | 60.92 ± 0.66 | 32.92 ± 2.17 | 72.00 ± 0.00 | 36.25 ± 0.43 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 (c4) | 3743.13 ± 46.92 | 1612.41 ± 658.43 | 0.00 ± 0.00 | 1612.45 ± 658.42 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 (c4) | 77.59 ± 1.33 | 23.67 ± 2.50 | 108.00 ± 4.00 | 27.00 ± 1.00 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 (c8) | 3771.81 ± 30.35 | 11125.29 ± 18460.22 | 2774.71 ± 1248.02 | 643.22 ± 716.90 | 2774.73 ± 1248.00 | ||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 (c8) | 95.50 ± 0.16 | 15.66 ± 2.09 | 159.50 ± 0.50 | 20.00 ± 0.00 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d4096 (c1) | 43.02 ± 0.02 | 43.02 ± 0.02 | 44.00 ± 0.00 | 44.00 ± 0.00 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d4096 (c2) | 3808.13 ± 50.05 | 9409.89 ± 608.78 | 2412.16 ± 815.83 | 327.84 ± 329.21 | 2412.20 ± 815.79 | ||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d4096 (c2) | 47.73 ± 0.34 | 29.34 ± 5.06 | 70.00 ± 0.00 | 35.50 ± 0.50 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d4096 (c4) | 3788.62 ± 22.88 | 3102.41 ± 1705.26 | 4268.68 ± 1830.10 | 1938.99 ± 1495.08 | 4268.71 ± 1830.07 | ||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d4096 (c4) | 52.05 ± 0.51 | 19.39 ± 4.72 | 108.00 ± 0.00 | 27.25 ± 0.43 | |||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | pp2048 @ d4096 (c8) | 3721.93 ± 8.59 | 1706.27 ± 1533.06 | 7682.40 ± 3767.43 | 5229.76 ± 3584.63 | 7682.42 ± 3767.42 | ||
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | tg128 @ d4096 (c8) | 54.17 ± 0.80 | 11.28 ± 3.45 | 144.00 ± 8.00 | 19.00 ± 1.80 |
NVFP4
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|---|---|
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 (c1) | 39.53 ± 0.01 | 39.53 ± 0.01 | 40.00 ± 0.00 | 40.00 ± 0.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 (c2) | 3999.16 ± 33.26 | 954.34 ± 108.56 | 0.00 ± 0.00 | 954.37 ± 108.55 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 (c2) | 73.08 ± 2.54 | 37.55 ± 1.04 | 78.00 ± 0.00 | 39.25 ± 0.43 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 (c4) | 4027.48 ± 1.36 | 1639.69 ± 512.64 | 0.00 ± 0.00 | 1639.75 ± 512.61 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 (c4) | 92.36 ± 0.78 | 27.97 ± 2.60 | 123.50 ± 0.50 | 30.88 ± 0.33 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 (c8) | 4061.79 ± 13.11 | 5988.49 ± 9483.77 | 2801.77 ± 1085.85 | 867.94 ± 761.08 | 2801.81 ± 1085.83 | ||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 (c8) | 117.15 ± 1.57 | 19.90 ± 2.89 | 196.00 ± 4.00 | 24.56 ± 0.50 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d4096 (c1) | 39.00 ± 0.05 | 39.00 ± 0.05 | 40.00 ± 0.00 | 40.00 ± 0.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d4096 (c2) | 4017.59 ± 23.06 | 6827.70 ± 134.02 | 2424.31 ± 634.37 | 450.11 ± 450.28 | 2424.33 ± 634.36 | ||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d4096 (c2) | 54.42 ± 0.03 | 32.59 ± 4.77 | 76.00 ± 0.00 | 38.00 ± 0.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d4096 (c4) | 4016.49 ± 1.27 | 2424.85 ± 960.92 | 4251.89 ± 1640.57 | 2181.83 ± 1510.95 | 4251.94 ± 1640.55 | ||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d4096 (c4) | 59.06 ± 0.60 | 22.45 ± 5.45 | 120.00 ± 4.00 | 30.50 ± 1.00 | |||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 @ d4096 (c8) | 4028.59 ± 18.27 | 1462.30 ± 1060.95 | 7396.25 ± 3437.88 | 5283.01 ± 3365.77 | 7396.31 ± 3437.87 | ||
| nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 | tg128 @ d4096 (c8) | 64.17 ± 0.11 | 14.39 ± 4.99 | 192.00 ± 0.00 | 24.62 ± 1.11 |
Effect of Speculative Decoding (SD) on token generation
| Quant | No SD | SD = 2 |
|---|---|---|
| AWQ 4bit | <33.2 tokens/s |
<59.9 tokens/s |
| FP8 | <44.1 tokens/s |
<62.1 tokens/s |
| NVFP4 | <39.5 tokens/s |
<64.8 tokens/s |
-
This is a record of Avg Token Generation in the vLLM logs when asking for the longest possible response on a specific topic in WebUI. Since differences from Speculative Decoding were not detectable in llama-benchy, this was verified manually.
-
As a reference, latest eugr’s vllm x gpt-oss 20b shows < 90 tokens/s, and gpt-oss 120b shows < 60 tokens/s without EAGLE3.
-
Note1: For AWQ 4-bit and FP8, using SD causes intermittent model crashes. The exact timing or conditions are difficult to pinpoint at this stage, but crashes were observed in both models after more than one week of use.
-
Note2: “num_speculative_tokens”: 2 is considered the current optimal setting. Increasing this value tends to lower the acceptance rate, which in turn leads to reduced throughput.
Korean ability test
| Quant | Language mixing | Jajang-1 | Jajang-2 |
|---|---|---|---|
| AWQ 4bit | Exist | Fail | Pass |
| FP8 | Rarely | Pass | Pass |
| NVFP4 | None | Pass | Pass |
-
This is not an official test, but rather one of the qualitative model evaluation methods used in the Korean LLM communities. It may not be relevant to English or Chinese-speaking users.
-
Smaller models and more heavily quantized models tend to exhibit deeper language mixing and degraded cultural comprehension. This test is designed to observe those phenomena. I’ve seen user reviews mentioning that tool use behaves strangely in quantized models (especially MoE models), and I suspect this is a similar type of phenomenon.
-
Language Mixing: When a question such as “Explain Maxwell’s equations in as much detail and length as possible” is asked in Korean, hiragana/katakana/Chinese characters/Arabic script, etc. may appear intermittently or frequently in the response. This phenomenon also occurs in the latest Sonnet/Haiku, Gemini Flash, gpt-5.3-codex, and similar models (though with varying frequency).
-
Jajang: This is a test designed to assess Korean language comprehension — specifically, whether the model understands pragmatic expressions and implicit Korean cultural context beyond the surface-level content. Looking at the song lyrics below, the test where the model must determine “whether the narrator is male or female” is labeled Jajang-1, and the test where it must determine “why the mother said she doesn’t like jajangmyeon” is labeled Jajang-2. For reference, the correct answer to Jajang-1 is “male,” and the correct answer to Jajang-2 is “a white lie born of selfless sacrifice — the mother, despite being in difficult financial circumstances, ordered an expensive dish using her secret savings, and gave it all to her child, while lying so that the child would not feel guilty or sorry.”. As a reference, gpt-oss 20b passes Jajang-1 but fails Jajang-2. 120b passes both.
-
Original lyric: 어려서부터 우리 집은 가난했었고 남들 다하는 외식 몇 번 한 적이 없었고 일터에 나가신 어머니 집에 없으면 언제나 혼자서 끓여 먹었던 라면 그러다 라면이 너무 지겨워서 맛있는 것 좀 먹자고 대들었었어 그러자 어머님이 마지못해 꺼내신 숨겨두신 비상금으로 시켜주신 짜장면 하나에 너무나 행복했었어 하지만 어머님은 왠지 드시질 않았어 어머님은 짜장면이 싫다고 하셨어 어머님은 짜장면이 싫다고 하셨어 야이야~야 그렇게 살아가고 그렇게 후회하고 눈물도 흘리고 야이야~야 그렇게 살아가고 너무나 아프고 하지만 다시 웃고
Appendix: Korean interaction results from other models
| Model (Full Name) | Description |
|---|---|
| cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit | Generates incoherent/gibberish sentences. |
| cyankiwi/Magistral-Small-2507-AWQ-4bit | Generates incoherent/gibberish sentences. |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 | Severe language mixing; crashes after a single response. |
| nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 | Slight language mixing occurs; characterized by maintaining around 40 tokens/s even in long contexts. |
| cyankiwi/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit & Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | Slight language mixing occurs. Both show similar levels of stability, but naturally, FP8 is slower than AWQ 4bit. AWQ 4bit seems preferable. |
| cyankiwi/GLM-4.5-Air-AWQ-4bit | No language mixing occurs, but it is too slow. |
| cyankiwi/GLM-4.7-Flash-AWQ-4bit (with eugr’s vllm) | No language mixing occurs. Maintains around 41 tokens/s, but the thinking process is too long. |
Some tested models may be missing from the list, likely because they didn’t function properly (e.g., Baidu Ernie).