Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark

As a personal research, I ran various cyankiwi’s AWQ 4-bit models, NVIDIA’s NVFP4 models and Qwen’s own official FP8 models. All results are based on a single-node setup.

I’m not sure if this forum needs such a rudimentary report, and I feel a bit embarrassed to share it, but I thought I’d share it in case there are people who want to use DGX spark right away in Out-Of-Box state or are planning a trial similar to mine.


Qwen3-Next AWQ 4-bit vs FP8 vs NVFP4

This is a comparison of AWQ 4-bit (cyankiwi) vs. FP8 (Qwen) vs. NVFP4 (NVIDIA), based on Qwen3 instruct.

For context, the original purpose was to benchmark out-of-the-box performance for Spark newcomers right after unboxing, so I used the NVCR vLLM image with minimal option changes and followed the recipe provided in each model card as-is.

If you think there might be meaningful differences with specific vLLM versions or configurations, feel free to share your thoughts. I’ll run those when I get the time and update the results accordingly.


Conclusion

@tbraun96 ‘s combination of vllm x NVFP4 is notably more stable in operation & quality than AWQ 4-bit. In fact, other NVFP4 models — such as Nemotron Nano — were unstable and prone to crashes, so the stable operation observed here is significant in its own right.

In qualitative testing, NVFP4 showed no meaningful difference from FP8. I plan to run a long-term stability test and, if it holds up, replace the current FP8 deployment with this NVFP4 build, because it shows better performance with high concurrency(c>2).


Used recipes for AWQ & FP8


llama-benchy results


AWQ 4bit

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit pp2048 3602.77 ± 726.67 601.33 ± 140.02 596.68 ± 140.02 601.44 ± 140.03
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit tg128 32.82 ± 0.03 33.67 ± 0.47
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit pp2048 @ d4096 4141.60 ± 54.07 1488.38 ± 19.21 1483.73 ± 19.21 1488.45 ± 19.21
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit tg128 @ d4096 32.56 ± 0.18 34.00 ± 0.00
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit pp2048 @ d8192 4041.35 ± 96.43 2539.88 ± 59.64 2535.23 ± 59.64 2539.96 ± 59.64
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit tg128 @ d8192 32.50 ± 0.43 34.67 ± 1.25
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit pp2048 @ d16384 3992.94 ± 11.33 4620.82 ± 13.09 4616.18 ± 13.09 4620.91 ± 13.09
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit tg128 @ d16384 31.01 ± 0.03 32.00 ± 0.00
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit pp2048 @ d32768 3650.88 ± 13.44 9541.11 ± 35.20 9536.46 ± 35.20 9541.22 ± 35.19
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit tg128 @ d32768 29.57 ± 0.22 30.67 ± 0.94
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit pp2048 @ d65535 3134.18 ± 8.21 21567.88 ± 56.68 21563.24 ± 56.68 21567.99 ± 56.69
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit tg128 @ d65535 26.93 ± 0.20 28.33 ± 0.47
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit pp2048 @ d100000 2727.70 ± 5.06 37416.32 ± 69.37 37411.68 ± 69.37 37416.43 ± 69.36
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit tg128 @ d100000 24.58 ± 0.31 26.00 ± 1.41

FP8

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 3472.79 ± 355.85 601.17 ± 59.20 595.78 ± 59.20 601.27 ± 59.19
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 44.56 ± 0.67 46.67 ± 1.89
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d4096 3770.69 ± 64.86 1635.29 ± 28.20 1629.90 ± 28.20 1635.40 ± 28.20
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d4096 43.41 ± 0.48 45.33 ± 1.70
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d8192 3795.74 ± 72.02 2704.13 ± 51.73 2698.74 ± 51.73 2704.24 ± 51.73
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d8192 42.57 ± 0.55 44.33 ± 1.89
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d16384 3655.91 ± 39.05 5047.67 ± 54.22 5042.28 ± 54.22 5047.77 ± 54.23
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d16384 40.60 ± 0.44 41.67 ± 0.94
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d32768 3383.68 ± 20.22 10295.16 ± 61.28 10289.77 ± 61.28 10295.28 ± 61.27
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d32768 38.50 ± 0.57 40.33 ± 1.89
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d65535 2922.68 ± 5.59 23129.08 ± 44.29 23123.69 ± 44.29 23129.17 ± 44.28
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d65535 33.73 ± 0.10 34.67 ± 0.47
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d100000 2514.38 ± 3.94 40590.94 ± 63.52 40585.55 ± 63.52 40591.06 ± 63.50
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d100000 30.14 ± 0.11 31.00 ± 0.00

NVFP4

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 4758.90 ± 7.31 497.65 ± 0.66 430.35 ± 0.66 497.72 ± 0.66
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 39.54 ± 0.03 40.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d4096 4262.97 ± 11.34 1508.55 ± 3.83 1441.26 ± 3.83 1508.61 ± 3.84
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d4096 39.02 ± 0.02 40.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d8192 4102.79 ± 20.83 2563.22 ± 12.63 2495.93 ± 12.63 2563.28 ± 12.63
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d8192 38.69 ± 0.01 39.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d16384 3802.17 ± 5.30 4915.06 ± 6.76 4847.76 ± 6.76 4915.12 ± 6.76
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d16384 37.92 ± 0.02 39.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d32768 3379.46 ± 12.13 10369.66 ± 36.88 10302.37 ± 36.88 10369.72 ± 36.88
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d32768 36.78 ± 0.04 37.67 ± 0.47
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d65535 2645.75 ± 2.06 25611.32 ± 19.92 25544.03 ± 19.92 25611.39 ± 19.93
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d65535 34.59 ± 0.01 36.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d100000 2251.31 ± 42.68 45411.66 ± 848.74 45344.37 ± 848.74 45411.73 ± 848.74
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d100000 32.57 ± 0.05 34.00 ± 0.00

Concurrency test


FP8

model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 (c1) 43.74 ± 0.00 43.74 ± 0.00 44.00 ± 0.00 44.00 ± 0.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 (c2) 3799.01 ± 5.44 804.97 ± 273.21 0.00 ± 0.00 805.00 ± 273.20
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 (c2) 60.92 ± 0.66 32.92 ± 2.17 72.00 ± 0.00 36.25 ± 0.43
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 (c4) 3743.13 ± 46.92 1612.41 ± 658.43 0.00 ± 0.00 1612.45 ± 658.42
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 (c4) 77.59 ± 1.33 23.67 ± 2.50 108.00 ± 4.00 27.00 ± 1.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 (c8) 3771.81 ± 30.35 11125.29 ± 18460.22 2774.71 ± 1248.02 643.22 ± 716.90 2774.73 ± 1248.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 (c8) 95.50 ± 0.16 15.66 ± 2.09 159.50 ± 0.50 20.00 ± 0.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d4096 (c1) 43.02 ± 0.02 43.02 ± 0.02 44.00 ± 0.00 44.00 ± 0.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d4096 (c2) 3808.13 ± 50.05 9409.89 ± 608.78 2412.16 ± 815.83 327.84 ± 329.21 2412.20 ± 815.79
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d4096 (c2) 47.73 ± 0.34 29.34 ± 5.06 70.00 ± 0.00 35.50 ± 0.50
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d4096 (c4) 3788.62 ± 22.88 3102.41 ± 1705.26 4268.68 ± 1830.10 1938.99 ± 1495.08 4268.71 ± 1830.07
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d4096 (c4) 52.05 ± 0.51 19.39 ± 4.72 108.00 ± 0.00 27.25 ± 0.43
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 pp2048 @ d4096 (c8) 3721.93 ± 8.59 1706.27 ± 1533.06 7682.40 ± 3767.43 5229.76 ± 3584.63 7682.42 ± 3767.42
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 tg128 @ d4096 (c8) 54.17 ± 0.80 11.28 ± 3.45 144.00 ± 8.00 19.00 ± 1.80

NVFP4

model test t/s (total) t/s (req) peak t/s peak t/s (req) ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 (c1) 39.53 ± 0.01 39.53 ± 0.01 40.00 ± 0.00 40.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 (c2) 3999.16 ± 33.26 954.34 ± 108.56 0.00 ± 0.00 954.37 ± 108.55
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 (c2) 73.08 ± 2.54 37.55 ± 1.04 78.00 ± 0.00 39.25 ± 0.43
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 (c4) 4027.48 ± 1.36 1639.69 ± 512.64 0.00 ± 0.00 1639.75 ± 512.61
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 (c4) 92.36 ± 0.78 27.97 ± 2.60 123.50 ± 0.50 30.88 ± 0.33
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 (c8) 4061.79 ± 13.11 5988.49 ± 9483.77 2801.77 ± 1085.85 867.94 ± 761.08 2801.81 ± 1085.83
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 (c8) 117.15 ± 1.57 19.90 ± 2.89 196.00 ± 4.00 24.56 ± 0.50
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d4096 (c1) 39.00 ± 0.05 39.00 ± 0.05 40.00 ± 0.00 40.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d4096 (c2) 4017.59 ± 23.06 6827.70 ± 134.02 2424.31 ± 634.37 450.11 ± 450.28 2424.33 ± 634.36
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d4096 (c2) 54.42 ± 0.03 32.59 ± 4.77 76.00 ± 0.00 38.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d4096 (c4) 4016.49 ± 1.27 2424.85 ± 960.92 4251.89 ± 1640.57 2181.83 ± 1510.95 4251.94 ± 1640.55
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d4096 (c4) 59.06 ± 0.60 22.45 ± 5.45 120.00 ± 4.00 30.50 ± 1.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 @ d4096 (c8) 4028.59 ± 18.27 1462.30 ± 1060.95 7396.25 ± 3437.88 5283.01 ± 3365.77 7396.31 ± 3437.87
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg128 @ d4096 (c8) 64.17 ± 0.11 14.39 ± 4.99 192.00 ± 0.00 24.62 ± 1.11

Effect of Speculative Decoding (SD) on token generation

Quant No SD SD = 2
AWQ 4bit <33.2 tokens/s <59.9 tokens/s
FP8 <44.1 tokens/s <62.1 tokens/s
NVFP4 <39.5 tokens/s <64.8 tokens/s
  • This is a record of Avg Token Generation in the vLLM logs when asking for the longest possible response on a specific topic in WebUI. Since differences from Speculative Decoding were not detectable in llama-benchy, this was verified manually.

  • As a reference, latest eugr’s vllm x gpt-oss 20b shows < 90 tokens/s, and gpt-oss 120b shows < 60 tokens/s without EAGLE3.

  • Note1: For AWQ 4-bit and FP8, using SD causes intermittent model crashes. The exact timing or conditions are difficult to pinpoint at this stage, but crashes were observed in both models after more than one week of use.

  • Note2: “num_speculative_tokens”: 2 is considered the current optimal setting. Increasing this value tends to lower the acceptance rate, which in turn leads to reduced throughput.


Korean ability test

Quant Language mixing Jajang-1 Jajang-2
AWQ 4bit Exist Fail Pass
FP8 Rarely Pass Pass
NVFP4 None Pass Pass
  • This is not an official test, but rather one of the qualitative model evaluation methods used in the Korean LLM communities. It may not be relevant to English or Chinese-speaking users.

  • Smaller models and more heavily quantized models tend to exhibit deeper language mixing and degraded cultural comprehension. This test is designed to observe those phenomena. I’ve seen user reviews mentioning that tool use behaves strangely in quantized models (especially MoE models), and I suspect this is a similar type of phenomenon.

  • Language Mixing: When a question such as “Explain Maxwell’s equations in as much detail and length as possible” is asked in Korean, hiragana/katakana/Chinese characters/Arabic script, etc. may appear intermittently or frequently in the response. This phenomenon also occurs in the latest Sonnet/Haiku, Gemini Flash, gpt-5.3-codex, and similar models (though with varying frequency).

  • Jajang: This is a test designed to assess Korean language comprehension — specifically, whether the model understands pragmatic expressions and implicit Korean cultural context beyond the surface-level content. Looking at the song lyrics below, the test where the model must determine “whether the narrator is male or female” is labeled Jajang-1, and the test where it must determine “why the mother said she doesn’t like jajangmyeon” is labeled Jajang-2. For reference, the correct answer to Jajang-1 is “male,” and the correct answer to Jajang-2 is “a white lie born of selfless sacrifice — the mother, despite being in difficult financial circumstances, ordered an expensive dish using her secret savings, and gave it all to her child, while lying so that the child would not feel guilty or sorry.”. As a reference, gpt-oss 20b passes Jajang-1 but fails Jajang-2. 120b passes both.

  • Original lyric: 어려서부터 우리 집은 가난했었고 남들 다하는 외식 몇 번 한 적이 없었고 일터에 나가신 어머니 집에 없으면 언제나 혼자서 끓여 먹었던 라면 그러다 라면이 너무 지겨워서 맛있는 것 좀 먹자고 대들었었어 그러자 어머님이 마지못해 꺼내신 숨겨두신 비상금으로 시켜주신 짜장면 하나에 너무나 행복했었어 하지만 어머님은 왠지 드시질 않았어 어머님은 짜장면이 싫다고 하셨어 어머님은 짜장면이 싫다고 하셨어 야이야~야 그렇게 살아가고 그렇게 후회하고 눈물도 흘리고 야이야~야 그렇게 살아가고 너무나 아프고 하지만 다시 웃고


Appendix: Korean interaction results from other models

Model (Full Name) Description
cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit Generates incoherent/gibberish sentences.
cyankiwi/Magistral-Small-2507-AWQ-4bit Generates incoherent/gibberish sentences.
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 Severe language mixing; crashes after a single response.
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 Slight language mixing occurs; characterized by maintaining around 40 tokens/s even in long contexts.
cyankiwi/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit & Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 Slight language mixing occurs. Both show similar levels of stability, but naturally, FP8 is slower than AWQ 4bit. AWQ 4bit seems preferable.
cyankiwi/GLM-4.5-Air-AWQ-4bit No language mixing occurs, but it is too slow.
cyankiwi/GLM-4.7-Flash-AWQ-4bit (with eugr’s vllm) No language mixing occurs. Maintains around 41 tokens/s, but the thinking process is too long.

Some tested models may be missing from the list, likely because they didn’t function properly (e.g., Baidu Ernie).

3 Likes

BTW, Flashinfer implementation allows to fit more context (actually 2x more into the context compared to FLASH_ATTN one), that’s for FP8 model.

docker run --rm --name dgx-vllm-nvfp4 \
    --network host --gpus all --ipc=host \
    -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
    -v /home/edison/Downloads/vllm/models:/models \
    -v $(pwd)/fix_flashinfer_e2m1_sm121.py:/tmp/fix1.py \
    -v $(pwd)/fix_flashinfer_nvfp4_moe_backend.py:/tmp/fix2.py \
    -v $(pwd)/fix_capability_121_v112.py:/tmp/fix3.py \
    -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
    -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
    -e VLLM_NVFP4_GEMM_BACKEND=marlin \
    -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
    -e MODEL=/models/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
    -e PORT=8888 \
    -e GPU_MEMORY_UTIL=0.70 \
    -e MAX_MODEL_LEN=65536 \
    -e MAX_NUM_SEQS=128 \
    -e VLLM_EXTRA_ARGS="--attention-backend flashinfer --kv-cache-dtype fp8" \
    --entrypoint bash \
    avarok/dgx-vllm-nvfp4-kernel:v22 \
    -c "python3 /tmp/fix1.py && python3 /tmp/fix2.py && python3 /tmp/fix3.py && \
        exec vllm serve \$MODEL --host 0.0.0.0 --port \$PORT \
        --max-model-len \$MAX_MODEL_LEN --gpu-memory-utilization \$GPU_MEMORY_UTIL \
        --max-num-seqs \$MAX_NUM_SEQS \$VLLM_EXTRA_ARGS"


| model                                     |   test |             t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:------------------------------------------|-------:|----------------:|-------------:|----------------:|----------------:|----------------:|
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  pp512 | 1587.78 ± 24.74 |              |   324.78 ± 4.95 |   322.54 ± 4.95 |   324.85 ± 4.96 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |   tg32 |    28.15 ± 0.05 | 29.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  pp512 | 1561.12 ± 23.78 |              |   330.29 ± 5.06 |   328.05 ± 5.06 |   330.35 ± 5.06 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  tg128 |    28.14 ± 0.05 | 29.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 |  2056.84 ± 9.92 |              |   997.96 ± 4.79 |   995.73 ± 4.79 |   998.03 ± 4.80 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |   tg32 |    28.00 ± 0.08 | 28.40 ± 0.49 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 | 2050.19 ± 17.61 |              |  1001.25 ± 8.64 |   999.01 ± 8.64 |  1001.31 ± 8.64 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  tg128 |    27.95 ± 0.06 | 28.80 ± 0.40 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp8192 | 1924.07 ± 10.77 |              | 4260.02 ± 23.78 | 4257.78 ± 23.78 | 4260.09 ± 23.76 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |   tg32 |    27.40 ± 0.05 | 28.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp8192 |  1928.50 ± 5.51 |              | 4250.14 ± 12.11 | 4247.90 ± 12.11 | 4250.22 ± 12.11 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  tg128 |    27.40 ± 0.05 | 28.00 ± 0.00 |                 |                 |                 |

Why I got 28 T/s only without MTP ( and much slower with MTP)?

Very interesting results! I wonder what makes NVFP4 not mix languages. This has been frustrating for me.

Hi,

I ran the image with the github reference commands. (the only change was MAX_MODEL_LEN=128000)

docker pull avarok/dgx-vllm-nvfp4-kernel:v22

docker run -d --name vllm-nvfp4
–network host --gpus all --ipc=host
-v $HOME/.cache/huggingface:/root/.cache/huggingface
-e VLLM_USE_FLASHINFER_MOE_FP4=0
-e VLLM_TEST_FORCE_FP8_MARLIN=1
-e VLLM_NVFP4_GEMM_BACKEND=marlin
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
-e MODEL=nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4
-e PORT=8888 -e GPU_MEMORY_UTIL=0.90
-e MAX_MODEL_LEN=65536 -e MAX_NUM_SEQS=128
-e VLLM_EXTRA_ARGS=“–attention-backend flashinfer --kv-cache-dtype fp8”
avarok/dgx-vllm-nvfp4-kernel:v22 serve

What I’ve found is that you tried to apply some fixes via fix.py files. How about trying with the original instructions?

Also llama-benchy shows no differences between MTP On/Off mode. As mentioned, I checked the differences with vllm logs.

BTW, you don’t have to use avarok image for that, Marlin NVFP4 works on a standard docker build as well, e.g.:

./launch-cluster.sh  --solo \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1\
   exec vllm serve nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4  \
   --gpu-memory-utilization 0.7 \
   --host 0.0.0.0 --port 8888 \
   --max-model-len 128000 \
   --load-format fastsafetensors
model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 pp2048 3692.34 ± 512.38 577.82 ± 93.12 567.51 ± 93.12 577.90 ± 93.12
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 tg32 43.28 ± 1.72 44.69 ± 1.78

llama-benchy (0.3.1)
date: 2026-02-22 21:51:18 | latency mode: api

1 Like

The NVFP4 version hasn’t been used long-term yet, so any issues may simply not have surfaced. FP8 also doesn’t frequently exhibit language mixing — as the word implies, it happens only “rarely” (IMO it is comparable to commercial model levels). I expect language mixing could also be observed with NVFP4 during extended use, and will post an update if anything is found.

That said, I have commonly observed language mixing across FP4/AWQ 4-bit models, which suggests the custom vLLM build may have addressed whatever was causing the issue.

For reference, I did attempt to run Qwen3-Next NVFP4 with the nvcr vLLM version as well, but the absence of any records suggests it either failed to load or crashed after a single response. As I recall, TRT-LLM also failed, and it was through tbraun96’s vLLM that normal operation was first confirmed. Actually, none of the NVFP4 models worked fine with the playbook examples at all.

I suspect it was related to running the NVIDIA DGX SPARK diagnostic program yesterday. I just unplugged the power adapter from the power strip, waited for a few minutes, and then restarted the system for testing. It has basically returned to normal. (We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! - #91 by cho)