As a personal research, I ran various cyankiwi’s AWQ 4-bit models, NVIDIA’s NVFP4 models and Qwen’s own official FP8 models. All results are based on a single-node setup. I’m not sure if this forum needs such a rudimentary report, and I feel a bit embarrassed to share it, but I thought I’d share it…

BTW, Flashinfer implementation allows to fit more context (actually 2x more into the context compared to FLASH_ATTN one), that’s for FP8 model.

docker run --rm --name dgx-vllm-nvfp4 \ --network host --gpus all --ipc=host \ -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \ -v /home/edison/Downloads/vllm/models:/models \ -v $(pwd)/fix_flashinfer_e2m1_sm121.py:/tmp/fix1.py \ -v $(pwd)/fix_flashinfer_nvfp4_moe_backe…

[image] SJ_gx10: Korean ability test Quant Language mixing Jajang-1 Jajang-2 AWQ 4bit Exist Fail Pass FP8 Rarely Pass Pass NVFP4 None Pass Pass Very interesting results! I wonder what makes NVFP4 not mix languages. This has been frustrating for me.

Hi, I ran the image with the github reference commands. (the only change was MAX_MODEL_LEN=128000) docker pull avarok/dgx-vllm-nvfp4-kernel:v22 docker run -d --name vllm-nvfp4 –network host --gpus all --ipc=host -v $HOME/.cache/huggingface:/root/.cache/huggingface -e VLLM_USE_FLASHINFER_MOE_F…

BTW, you don’t have to use avarok image for that, Marlin NVFP4 works on a standard docker build as well, e.g.: ./launch-cluster.sh --solo \ -e VLLM_NVFP4_GEMM_BACKEND=marlin \ -e VLLM_TEST_FORCE_FP8_MARLIN=1 \ -e VLLM_MARLIN_USE_ATOMIC_ADD=1\ exec vllm serve nvidia/Qwen3-Next-80B-A3B-Inst…

The NVFP4 version hasn’t been used long-term yet, so any issues may simply not have surfaced. FP8 also doesn’t frequently exhibit language mixing — as the word implies, it happens only “rarely” (IMO it is comparable to commercial model levels). I expect language mixing could also be observed with NV…

Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

cho February 23, 2026, 6:29am 8

I suspect it was related to running the NVIDIA DGX SPARK diagnostic program yesterday. I just unplugged the power adapter from the power strip, waited for a few minutes, and then restarted the system for testing. It has basically returned to normal. (We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! - #91 by cho)

Topic		Replies	Views
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	231	10094	April 21, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6896	March 28, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	26	4554	April 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4860	April 11, 2026
Best Q4 / NVFP4 model for quality Qwen3.5-27B or alternatives? DGX Spark / GB10 llama , deepseek , nemotron	16	1393	April 26, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9330	April 9, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1571	January 7, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2881	December 31, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1292	February 13, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	15159	March 24, 2026

Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark

Related topics