Llama.cpp experimental native mxfp4 support for blackwell PR

_cjg · January 1, 2026, 6:35am

I am surprised that I only achieve a maximum of approx. 12t/s with the mxfp4 version via the llama.cpp WebUI (via llama-bench I get 59t/s, tg128 @ d4096). With the Q8-XL version, I get 54t/s via llama-bench and via the WebUI (and continue.dev).

my cml (I use the longer cml arguments because they are easier for me to understand even after some time has passed).

/home/cjg/Projekte/01_llama.cpp/llama.cpp/build/bin/llama-server -hf unsloth/gpt-oss-120b-GGUF:Q8_K_XL --alias “gpt-oss-120b|Q8-XL” --jinja --gpu-layers 999 --ctx-size 128000 --host 0.0.0.0 --port 51011 --flash-attn 1 --batch-size 2048 --ubatch-size 2048 --no-mmap --log-file /home/cjg/.cache/llama.cpp/log/llama-server.log --log-timestamps --log-verbosity 3

Is this also the case for you, or is my cml incorrect?

Topic		Replies	Views
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	144	6924	March 10, 2026
Compiling llama.cpp DGX Spark / GB10 llama	14	1823	February 7, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	3026	December 31, 2025
Llama.cpp GLM 4.7 Flash Benchmark DGX Spark / GB10 llama	1	383	February 19, 2026
Single node and Dual node llama.cpp build flag DGX Spark / GB10 llama	5	207	March 11, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2490	February 25, 2026
50%+ Improvement on spark?! DGX Spark / GB10 cuda , deepseek	26	2078	April 7, 2026
vLLM 0.17.0 MXFP4 Patches for DGX Spark: Qwen3.5-35B-A3B 70 tok/s, gpt-oss-120b 80 tok/s (TP=2) DGX Spark / GB10	32	2418	April 13, 2026
Dgx spark benchmark performance DGX Spark / GB10	17	2068	January 4, 2026
(sparkrun) Qwen3.5 GGUF Benchmarks over llama.cpp RPC DGX Spark / GB10 Projects llama	3	690	March 11, 2026

Llama.cpp experimental native mxfp4 support for blackwell PR

Related topics