As a personal research, I ran various cyankiwi’s AWQ 4-bit models, NVIDIA’s NVFP4 models and Qwen’s own official FP8 models. All results are based on a single-node setup. I’m not sure if this forum needs such a rudimentary report, and I feel a bit embarrassed to share it, but I thought I’d share it…

docker run --rm --name dgx-vllm-nvfp4 \ --network host --gpus all --ipc=host \ -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \ -v /home/edison/Downloads/vllm/models:/models \ -v $(pwd)/fix_flashinfer_e2m1_sm121.py:/tmp/fix1.py \ -v $(pwd)/fix_flashinfer_nvfp4_moe_backe…

[image] SJ_gx10: Korean ability test Quant Language mixing Jajang-1 Jajang-2 AWQ 4bit Exist Fail Pass FP8 Rarely Pass Pass NVFP4 None Pass Pass Very interesting results! I wonder what makes NVFP4 not mix languages. This has been frustrating for me.

Hi, I ran the image with the github reference commands. (the only change was MAX_MODEL_LEN=128000) docker pull avarok/dgx-vllm-nvfp4-kernel:v22 docker run -d --name vllm-nvfp4 –network host --gpus all --ipc=host -v $HOME/.cache/huggingface:/root/.cache/huggingface -e VLLM_USE_FLASHINFER_MOE_F…

BTW, you don’t have to use avarok image for that, Marlin NVFP4 works on a standard docker build as well, e.g.: ./launch-cluster.sh --solo \ -e VLLM_NVFP4_GEMM_BACKEND=marlin \ -e VLLM_TEST_FORCE_FP8_MARLIN=1 \ -e VLLM_MARLIN_USE_ATOMIC_ADD=1\ exec vllm serve nvidia/Qwen3-Next-80B-A3B-Inst…

The NVFP4 version hasn’t been used long-term yet, so any issues may simply not have surfaced. FP8 also doesn’t frequently exhibit language mixing — as the word implies, it happens only “rarely” (IMO it is comparable to commercial model levels). I expect language mixing could also be observed with NV…

I suspect it was related to running the NVIDIA DGX SPARK diagnostic program yesterday. I just unplugged the power adapter from the power strip, waited for a few minutes, and then restarted the system for testing. It has basically returned to normal. ( We unlocked NVFP4 on the DGX Spark: 20% faster th…

Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

eugr February 22, 2026, 4:57am 2

BTW, Flashinfer implementation allows to fit more context (actually 2x more into the context compared to FLASH_ATTN one), that’s for FP8 model.

Topic		Replies	Views
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	224	8293	April 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6207	March 28, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4182	April 11, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1489	January 7, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2732	December 31, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1128	February 13, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14369	March 24, 2026
Can someone with 2 Sparks benchmark NVFP4 MiniMax M2.1 quant? DGX Spark / GB10	25	1371	January 29, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	214	4818	March 27, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2403	March 26, 2026

Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark

Related topics