Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark

SJ_gx10 · February 22, 2026, 3:11am

As a personal research, I ran various cyankiwi’s AWQ 4-bit models, NVIDIA’s NVFP4 models and Qwen’s own official FP8 models. All results are based on a single-node setup.

I’m not sure if this forum needs such a rudimentary report, and I feel a bit embarrassed to share it, but I thought I’d share it in case there are people who want to use DGX spark right away in Out-Of-Box state or are planning a trial similar to mine.

Qwen3-Next AWQ 4-bit vs FP8 vs NVFP4

This is a comparison of AWQ 4-bit (cyankiwi) vs. FP8 (Qwen) vs. NVFP4 (NVIDIA), based on Qwen3 instruct.

For context, the original purpose was to benchmark out-of-the-box performance for Spark newcomers right after unboxing, so I used the NVCR vLLM image with minimal option changes and followed the recipe provided in each model card as-is.

If you think there might be meaningful differences with specific vLLM versions or configurations, feel free to share your thoughts. I’ll run those when I get the time and update the results accordingly.

Conclusion

@tbraun96 ‘s combination of vllm x NVFP4 is notably more stable in operation & quality than AWQ 4-bit. In fact, other NVFP4 models — such as Nemotron Nano — were unstable and prone to crashes, so the stable operation observed here is significant in its own right.

In qualitative testing, NVFP4 showed no meaningful difference from FP8. I plan to run a long-term stability test and, if it holds up, replace the current FP8 deployment with this NVFP4 build, because it shows better performance with high concurrency(c>2).

Used recipes for AWQ & FP8

vllm docker for AWQ 4bit & FP8
docker run --privileged --gpus all --net=host --ipc=host --name vllm_nvcr -it --rm -v ~/.cache/huggingface:/root/.cache/huggingface nvcr.io/nvidia/vllm:26.01-py3
vllm recipe
vllm serve {model} --host 0.0.0.0 --port 8000 --gpu_memory_utilization 0.85 --speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:2}’
model: cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit, Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
NVFP4 recipes @ GitHub - Avarok-Cybersecurity/dgx-vllm: A dedicated effort to make an optimized, bleeding edge vLLM image using Docker to support DGX comprehensively
@tbraun96, thank you for your contribution.
I could not find any t/s differences between FLASH_ATTN and FLASHINFER for AWQ 4bit and FP8.

llama-benchy results

AWQ 4bit

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	pp2048	3602.77 ± 726.67		601.33 ± 140.02	596.68 ± 140.02	601.44 ± 140.03
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	tg128	32.82 ± 0.03	33.67 ± 0.47
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	pp2048 @ d4096	4141.60 ± 54.07		1488.38 ± 19.21	1483.73 ± 19.21	1488.45 ± 19.21
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	tg128 @ d4096	32.56 ± 0.18	34.00 ± 0.00
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	pp2048 @ d8192	4041.35 ± 96.43		2539.88 ± 59.64	2535.23 ± 59.64	2539.96 ± 59.64
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	tg128 @ d8192	32.50 ± 0.43	34.67 ± 1.25
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	pp2048 @ d16384	3992.94 ± 11.33		4620.82 ± 13.09	4616.18 ± 13.09	4620.91 ± 13.09
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	tg128 @ d16384	31.01 ± 0.03	32.00 ± 0.00
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	pp2048 @ d32768	3650.88 ± 13.44		9541.11 ± 35.20	9536.46 ± 35.20	9541.22 ± 35.19
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	tg128 @ d32768	29.57 ± 0.22	30.67 ± 0.94
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	pp2048 @ d65535	3134.18 ± 8.21		21567.88 ± 56.68	21563.24 ± 56.68	21567.99 ± 56.69
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	tg128 @ d65535	26.93 ± 0.20	28.33 ± 0.47
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	pp2048 @ d100000	2727.70 ± 5.06		37416.32 ± 69.37	37411.68 ± 69.37	37416.43 ± 69.36
cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit	tg128 @ d100000	24.58 ± 0.31	26.00 ± 1.41

FP8

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048	3472.79 ± 355.85		601.17 ± 59.20	595.78 ± 59.20	601.27 ± 59.19
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128	44.56 ± 0.67	46.67 ± 1.89
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d4096	3770.69 ± 64.86		1635.29 ± 28.20	1629.90 ± 28.20	1635.40 ± 28.20
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d4096	43.41 ± 0.48	45.33 ± 1.70
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d8192	3795.74 ± 72.02		2704.13 ± 51.73	2698.74 ± 51.73	2704.24 ± 51.73
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d8192	42.57 ± 0.55	44.33 ± 1.89
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d16384	3655.91 ± 39.05		5047.67 ± 54.22	5042.28 ± 54.22	5047.77 ± 54.23
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d16384	40.60 ± 0.44	41.67 ± 0.94
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d32768	3383.68 ± 20.22		10295.16 ± 61.28	10289.77 ± 61.28	10295.28 ± 61.27
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d32768	38.50 ± 0.57	40.33 ± 1.89
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d65535	2922.68 ± 5.59		23129.08 ± 44.29	23123.69 ± 44.29	23129.17 ± 44.28
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d65535	33.73 ± 0.10	34.67 ± 0.47
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d100000	2514.38 ± 3.94		40590.94 ± 63.52	40585.55 ± 63.52	40591.06 ± 63.50
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d100000	30.14 ± 0.11	31.00 ± 0.00

NVFP4

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048	4758.90 ± 7.31		497.65 ± 0.66	430.35 ± 0.66	497.72 ± 0.66
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128	39.54 ± 0.03	40.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d4096	4262.97 ± 11.34		1508.55 ± 3.83	1441.26 ± 3.83	1508.61 ± 3.84
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d4096	39.02 ± 0.02	40.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d8192	4102.79 ± 20.83		2563.22 ± 12.63	2495.93 ± 12.63	2563.28 ± 12.63
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d8192	38.69 ± 0.01	39.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d16384	3802.17 ± 5.30		4915.06 ± 6.76	4847.76 ± 6.76	4915.12 ± 6.76
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d16384	37.92 ± 0.02	39.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d32768	3379.46 ± 12.13		10369.66 ± 36.88	10302.37 ± 36.88	10369.72 ± 36.88
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d32768	36.78 ± 0.04	37.67 ± 0.47
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d65535	2645.75 ± 2.06		25611.32 ± 19.92	25544.03 ± 19.92	25611.39 ± 19.93
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d65535	34.59 ± 0.01	36.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d100000	2251.31 ± 42.68		45411.66 ± 848.74	45344.37 ± 848.74	45411.73 ± 848.74
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d100000	32.57 ± 0.05	34.00 ± 0.00

Concurrency test

FP8

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 (c1)	43.74 ± 0.00	43.74 ± 0.00	44.00 ± 0.00	44.00 ± 0.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 (c2)	3799.01 ± 5.44				804.97 ± 273.21	0.00 ± 0.00	805.00 ± 273.20
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 (c2)	60.92 ± 0.66	32.92 ± 2.17	72.00 ± 0.00	36.25 ± 0.43
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 (c4)	3743.13 ± 46.92				1612.41 ± 658.43	0.00 ± 0.00	1612.45 ± 658.42
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 (c4)	77.59 ± 1.33	23.67 ± 2.50	108.00 ± 4.00	27.00 ± 1.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 (c8)	3771.81 ± 30.35	11125.29 ± 18460.22			2774.71 ± 1248.02	643.22 ± 716.90	2774.73 ± 1248.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 (c8)	95.50 ± 0.16	15.66 ± 2.09	159.50 ± 0.50	20.00 ± 0.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d4096 (c1)	43.02 ± 0.02	43.02 ± 0.02	44.00 ± 0.00	44.00 ± 0.00
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d4096 (c2)	3808.13 ± 50.05	9409.89 ± 608.78			2412.16 ± 815.83	327.84 ± 329.21	2412.20 ± 815.79
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d4096 (c2)	47.73 ± 0.34	29.34 ± 5.06	70.00 ± 0.00	35.50 ± 0.50
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d4096 (c4)	3788.62 ± 22.88	3102.41 ± 1705.26			4268.68 ± 1830.10	1938.99 ± 1495.08	4268.71 ± 1830.07
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d4096 (c4)	52.05 ± 0.51	19.39 ± 4.72	108.00 ± 0.00	27.25 ± 0.43
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	pp2048 @ d4096 (c8)	3721.93 ± 8.59	1706.27 ± 1533.06			7682.40 ± 3767.43	5229.76 ± 3584.63	7682.42 ± 3767.42
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8	tg128 @ d4096 (c8)	54.17 ± 0.80	11.28 ± 3.45	144.00 ± 8.00	19.00 ± 1.80

NVFP4

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 (c1)	39.53 ± 0.01	39.53 ± 0.01	40.00 ± 0.00	40.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 (c2)	3999.16 ± 33.26				954.34 ± 108.56	0.00 ± 0.00	954.37 ± 108.55
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 (c2)	73.08 ± 2.54	37.55 ± 1.04	78.00 ± 0.00	39.25 ± 0.43
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 (c4)	4027.48 ± 1.36				1639.69 ± 512.64	0.00 ± 0.00	1639.75 ± 512.61
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 (c4)	92.36 ± 0.78	27.97 ± 2.60	123.50 ± 0.50	30.88 ± 0.33
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 (c8)	4061.79 ± 13.11	5988.49 ± 9483.77			2801.77 ± 1085.85	867.94 ± 761.08	2801.81 ± 1085.83
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 (c8)	117.15 ± 1.57	19.90 ± 2.89	196.00 ± 4.00	24.56 ± 0.50
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d4096 (c1)	39.00 ± 0.05	39.00 ± 0.05	40.00 ± 0.00	40.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d4096 (c2)	4017.59 ± 23.06	6827.70 ± 134.02			2424.31 ± 634.37	450.11 ± 450.28	2424.33 ± 634.36
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d4096 (c2)	54.42 ± 0.03	32.59 ± 4.77	76.00 ± 0.00	38.00 ± 0.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d4096 (c4)	4016.49 ± 1.27	2424.85 ± 960.92			4251.89 ± 1640.57	2181.83 ± 1510.95	4251.94 ± 1640.55
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d4096 (c4)	59.06 ± 0.60	22.45 ± 5.45	120.00 ± 4.00	30.50 ± 1.00
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048 @ d4096 (c8)	4028.59 ± 18.27	1462.30 ± 1060.95			7396.25 ± 3437.88	5283.01 ± 3365.77	7396.31 ± 3437.87
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg128 @ d4096 (c8)	64.17 ± 0.11	14.39 ± 4.99	192.00 ± 0.00	24.62 ± 1.11

Effect of Speculative Decoding (SD) on token generation

Quant	No SD	SD = 2
AWQ 4bit	`<33.2 tokens/s`	`<59.9 tokens/s`
FP8	`<44.1 tokens/s`	`<62.1 tokens/s`
NVFP4	`<39.5 tokens/s`	`<64.8 tokens/s`

This is a record of Avg Token Generation in the vLLM logs when asking for the longest possible response on a specific topic in WebUI. Since differences from Speculative Decoding were not detectable in llama-benchy, this was verified manually.
As a reference, latest eugr’s vllm x gpt-oss 20b shows < 90 tokens/s, and gpt-oss 120b shows < 60 tokens/s without EAGLE3.
Note1: For AWQ 4-bit and FP8, using SD causes intermittent model crashes. The exact timing or conditions are difficult to pinpoint at this stage, but crashes were observed in both models after more than one week of use.
Note2: “num_speculative_tokens”: 2 is considered the current optimal setting. Increasing this value tends to lower the acceptance rate, which in turn leads to reduced throughput.

Korean ability test

Quant	Language mixing	Jajang-1	Jajang-2
AWQ 4bit	Exist	Fail	Pass
FP8	Rarely	Pass	Pass
NVFP4	None	Pass	Pass

This is not an official test, but rather one of the qualitative model evaluation methods used in the Korean LLM communities. It may not be relevant to English or Chinese-speaking users.
Smaller models and more heavily quantized models tend to exhibit deeper language mixing and degraded cultural comprehension. This test is designed to observe those phenomena. I’ve seen user reviews mentioning that tool use behaves strangely in quantized models (especially MoE models), and I suspect this is a similar type of phenomenon.
Language Mixing: When a question such as “Explain Maxwell’s equations in as much detail and length as possible” is asked in Korean, hiragana/katakana/Chinese characters/Arabic script, etc. may appear intermittently or frequently in the response. This phenomenon also occurs in the latest Sonnet/Haiku, Gemini Flash, gpt-5.3-codex, and similar models (though with varying frequency).
Jajang: This is a test designed to assess Korean language comprehension — specifically, whether the model understands pragmatic expressions and implicit Korean cultural context beyond the surface-level content. Looking at the song lyrics below, the test where the model must determine “whether the narrator is male or female” is labeled Jajang-1, and the test where it must determine “why the mother said she doesn’t like jajangmyeon” is labeled Jajang-2. For reference, the correct answer to Jajang-1 is “male,” and the correct answer to Jajang-2 is “a white lie born of selfless sacrifice — the mother, despite being in difficult financial circumstances, ordered an expensive dish using her secret savings, and gave it all to her child, while lying so that the child would not feel guilty or sorry.”. As a reference, gpt-oss 20b passes Jajang-1 but fails Jajang-2. 120b passes both.
Original lyric: 어려서부터 우리 집은 가난했었고 남들 다하는 외식 몇 번 한 적이 없었고 일터에 나가신 어머니 집에 없으면 언제나 혼자서 끓여 먹었던 라면 그러다 라면이 너무 지겨워서 맛있는 것 좀 먹자고 대들었었어 그러자 어머님이 마지못해 꺼내신 숨겨두신 비상금으로 시켜주신 짜장면 하나에 너무나 행복했었어 하지만 어머님은 왠지 드시질 않았어 어머님은 짜장면이 싫다고 하셨어 어머님은 짜장면이 싫다고 하셨어 야이야~야 그렇게 살아가고 그렇게 후회하고 눈물도 흘리고 야이야~야 그렇게 살아가고 너무나 아프고 하지만 다시 웃고

Appendix: Korean interaction results from other models

Model (Full Name)	Description
cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit	Generates incoherent/gibberish sentences.
cyankiwi/Magistral-Small-2507-AWQ-4bit	Generates incoherent/gibberish sentences.
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4	Severe language mixing; crashes after a single response.
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8	Slight language mixing occurs; characterized by maintaining around 40 tokens/s even in long contexts.
cyankiwi/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit & Qwen/Qwen3-30B-A3B-Instruct-2507-FP8	Slight language mixing occurs. Both show similar levels of stability, but naturally, FP8 is slower than AWQ 4bit. AWQ 4bit seems preferable.
cyankiwi/GLM-4.5-Air-AWQ-4bit	No language mixing occurs, but it is too slow.
cyankiwi/GLM-4.7-Flash-AWQ-4bit (with eugr’s vllm)	No language mixing occurs. Maintains around 41 tokens/s, but the thinking process is too long.

Some tested models may be missing from the list, likely because they didn’t function properly (e.g., Baidu Ernie).

eugr · February 22, 2026, 4:57am

BTW, Flashinfer implementation allows to fit more context (actually 2x more into the context compared to FLASH_ATTN one), that’s for FP8 model.

cho · February 22, 2026, 5:28pm

docker run --rm --name dgx-vllm-nvfp4 \
    --network host --gpus all --ipc=host \
    -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
    -v /home/edison/Downloads/vllm/models:/models \
    -v $(pwd)/fix_flashinfer_e2m1_sm121.py:/tmp/fix1.py \
    -v $(pwd)/fix_flashinfer_nvfp4_moe_backend.py:/tmp/fix2.py \
    -v $(pwd)/fix_capability_121_v112.py:/tmp/fix3.py \
    -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
    -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
    -e VLLM_NVFP4_GEMM_BACKEND=marlin \
    -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
    -e MODEL=/models/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
    -e PORT=8888 \
    -e GPU_MEMORY_UTIL=0.70 \
    -e MAX_MODEL_LEN=65536 \
    -e MAX_NUM_SEQS=128 \
    -e VLLM_EXTRA_ARGS="--attention-backend flashinfer --kv-cache-dtype fp8" \
    --entrypoint bash \
    avarok/dgx-vllm-nvfp4-kernel:v22 \
    -c "python3 /tmp/fix1.py && python3 /tmp/fix2.py && python3 /tmp/fix3.py && \
        exec vllm serve \$MODEL --host 0.0.0.0 --port \$PORT \
        --max-model-len \$MAX_MODEL_LEN --gpu-memory-utilization \$GPU_MEMORY_UTIL \
        --max-num-seqs \$MAX_NUM_SEQS \$VLLM_EXTRA_ARGS"


| model                                     |   test |             t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:------------------------------------------|-------:|----------------:|-------------:|----------------:|----------------:|----------------:|
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  pp512 | 1587.78 ± 24.74 |              |   324.78 ± 4.95 |   322.54 ± 4.95 |   324.85 ± 4.96 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |   tg32 |    28.15 ± 0.05 | 29.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  pp512 | 1561.12 ± 23.78 |              |   330.29 ± 5.06 |   328.05 ± 5.06 |   330.35 ± 5.06 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  tg128 |    28.14 ± 0.05 | 29.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 |  2056.84 ± 9.92 |              |   997.96 ± 4.79 |   995.73 ± 4.79 |   998.03 ± 4.80 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |   tg32 |    28.00 ± 0.08 | 28.40 ± 0.49 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp2048 | 2050.19 ± 17.61 |              |  1001.25 ± 8.64 |   999.01 ± 8.64 |  1001.31 ± 8.64 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  tg128 |    27.95 ± 0.06 | 28.80 ± 0.40 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp8192 | 1924.07 ± 10.77 |              | 4260.02 ± 23.78 | 4257.78 ± 23.78 | 4260.09 ± 23.76 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |   tg32 |    27.40 ± 0.05 | 28.00 ± 0.00 |                 |                 |                 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 | pp8192 |  1928.50 ± 5.51 |              | 4250.14 ± 12.11 | 4247.90 ± 12.11 | 4250.22 ± 12.11 |
| /models/Qwen3-Next-80B-A3B-Instruct-NVFP4 |  tg128 |    27.40 ± 0.05 | 28.00 ± 0.00 |                 |                 |                 |

Why I got 28 T/s only without MTP ( and much slower with MTP)?

notmy.reward438 · February 23, 2026, 3:57am

Very interesting results! I wonder what makes NVFP4 not mix languages. This has been frustrating for me.

SJ_gx10 · February 23, 2026, 5:56am

Hi,

I ran the image with the github reference commands. (the only change was MAX_MODEL_LEN=128000)

docker pull avarok/dgx-vllm-nvfp4-kernel:v22

docker run -d --name vllm-nvfp4
–network host --gpus all --ipc=host
-v $HOME/.cache/huggingface:/root/.cache/huggingface
-e VLLM_USE_FLASHINFER_MOE_FP4=0
-e VLLM_TEST_FORCE_FP8_MARLIN=1
-e VLLM_NVFP4_GEMM_BACKEND=marlin
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
-e MODEL=nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4
-e PORT=8888 -e GPU_MEMORY_UTIL=0.90
-e MAX_MODEL_LEN=65536 -e MAX_NUM_SEQS=128
-e VLLM_EXTRA_ARGS=“–attention-backend flashinfer --kv-cache-dtype fp8”
avarok/dgx-vllm-nvfp4-kernel:v22 serve

What I’ve found is that you tried to apply some fixes via fix.py files. How about trying with the original instructions?

Also llama-benchy shows no differences between MTP On/Off mode. As mentioned, I checked the differences with vllm logs.

eugr · February 23, 2026, 6:04am

BTW, you don’t have to use avarok image for that, Marlin NVFP4 works on a standard docker build as well, e.g.:

./launch-cluster.sh  --solo \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1\
   exec vllm serve nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4  \
   --gpu-memory-utilization 0.7 \
   --host 0.0.0.0 --port 8888 \
   --max-model-len 128000 \
   --load-format fastsafetensors

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	pp2048	3692.34 ± 512.38		577.82 ± 93.12	567.51 ± 93.12	577.90 ± 93.12
nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4	tg32	43.28 ± 1.72	44.69 ± 1.78

llama-benchy (0.3.1)
date: 2026-02-22 21:51:18 | latency mode: api

SJ_gx10 · February 23, 2026, 6:15am

The NVFP4 version hasn’t been used long-term yet, so any issues may simply not have surfaced. FP8 also doesn’t frequently exhibit language mixing — as the word implies, it happens only “rarely” (IMO it is comparable to commercial model levels). I expect language mixing could also be observed with NVFP4 during extended use, and will post an update if anything is found.

That said, I have commonly observed language mixing across FP4/AWQ 4-bit models, which suggests the custom vLLM build may have addressed whatever was causing the issue.

For reference, I did attempt to run Qwen3-Next NVFP4 with the nvcr vLLM version as well, but the absence of any records suggests it either failed to load or crashed after a single response. As I recall, TRT-LLM also failed, and it was through tbraun96’s vLLM that normal operation was first confirmed. Actually, none of the NVFP4 models worked fine with the playbook examples at all.

cho · February 23, 2026, 6:29am

I suspect it was related to running the NVIDIA DGX SPARK diagnostic program yesterday. I just unplugged the power adapter from the power strip, waited for a few minutes, and then restarted the system for testing. It has basically returned to normal. (We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! - #91 by cho)

Topic		Replies	Views
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	231	10080	April 21, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6889	March 28, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	26	4487	April 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4855	April 11, 2026
Best Q4 / NVFP4 model for quality Qwen3.5-27B or alternatives? DGX Spark / GB10 llama , deepseek , nemotron	16	1378	April 26, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9324	April 9, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1570	January 7, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2879	December 31, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1291	February 13, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	15152	March 24, 2026

Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark

Qwen3-Next AWQ 4-bit vs FP8 vs NVFP4

Conclusion

Used recipes for AWQ & FP8

llama-benchy results

AWQ 4bit

FP8

NVFP4

Concurrency test

FP8

NVFP4

Effect of Speculative Decoding (SD) on token generation

Korean ability test

Appendix: Korean interaction results from other models

Related topics