Llama.cpp GLM 4.7 Flash Benchmark

aceangel · January 26, 2026, 8:34pm

I’ve been experimenting with an early branch of llama.cpp using GLM 4.7 Flash with the FA fix (CUDA: add gqa_ratio 4 for GLM 4.7 flash by am17an · Pull Request #18953 · ggml-org/llama.cpp · GitHub), and I’ve noticed it generates tokens faster. However, the prompt processing still isn’t as quick as vLLM, and I’m not sure if I’m missing something there. It also seems like the main llama.cpp branch doesn’t include some of these optimizations yet, even though the fix has already been merged.

./build/bin/llama-server
–hf-repo unsloth/GLM-4.7-Flash-GGUF
–host 0.0.0.0
–port 41447
-c 524288
-ngl 999
-t 20
-np 4
–cont-batching
-ub 2048
-fa 1
–jinja
–reasoning-format auto
–no-mmap

uvx llama-benchy
–base-url http://localhost:41447/v1
–model unsloth/glm-4.7-flash-GGUF
–pp 2048
–tg 32
–depth 0 4096 8192 16384 32768 65535 100000
–runs 1
–enable-prefix-caching
–latency-mode generation

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
unsloth/glm-4.7-flash-GGUF	pp2048	2026.82 ± 0.00	983.58 ± 0.00	955.19 ± 0.00	983.62 ± 0.00
unsloth/glm-4.7-flash-GGUF	tg32	52.46 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_pp @ d4096	1768.13 ± 0.00	2009.01 ± 0.00	1980.62 ± 0.00	2009.06 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_tg @ d4096	55.43 ± 0.00
unsloth/glm-4.7-flash-GGUF	pp2048 @ d4096	1502.19 ± 0.00	1391.74 ± 0.00	1363.35 ± 0.00	1391.77 ± 0.00
unsloth/glm-4.7-flash-GGUF	tg32 @ d4096	47.40 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_pp @ d8192	1354.72 ± 0.00	5271.56 ± 0.00	5243.17 ± 0.00	5271.62 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_tg @ d8192	44.99 ± 0.00
unsloth/glm-4.7-flash-GGUF	pp2048 @ d8192	987.55 ± 0.00	2102.21 ± 0.00	2073.82 ± 0.00	2102.25 ± 0.00
unsloth/glm-4.7-flash-GGUF	tg32 @ d8192	47.63 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_pp @ d16384	925.85 ± 0.00	15896.01 ± 0.00	15867.62 ± 0.00	15896.04 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_tg @ d16384	41.33 ± 0.00
unsloth/glm-4.7-flash-GGUF	pp2048 @ d16384	600.23 ± 0.00	3440.40 ± 0.00	3412.01 ± 0.00	3440.43 ± 0.00
unsloth/glm-4.7-flash-GGUF	tg32 @ d16384	40.08 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_pp @ d32768	569.48 ± 0.00	51472.06 ± 0.00	51443.67 ± 0.00	51472.08 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_tg @ d32768	17.39 ± 0.00
unsloth/glm-4.7-flash-GGUF	pp2048 @ d32768	336.61 ± 0.00	6112.50 ± 0.00	6084.11 ± 0.00	6112.53 ± 0.00
unsloth/glm-4.7-flash-GGUF	tg32 @ d32768	16.63 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_pp @ d65535	317.69 ± 0.00	182461.28 ± 0.00	182432.89 ± 0.00	182461.32 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_tg @ d65535	11.02 ± 0.00
unsloth/glm-4.7-flash-GGUF	pp2048 @ d65535	184.63 ± 0.00	11121.08 ± 0.00	11092.69 ± 0.00	11121.11 ± 0.00
unsloth/glm-4.7-flash-GGUF	tg32 @ d65535	10.33 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_pp @ d100000	218.32 ± 0.00	405616.25 ± 0.00	405587.86 ± 0.00	405616.27 ± 0.00
unsloth/glm-4.7-flash-GGUF	ctx_tg @ d100000	7.68 ± 0.00
unsloth/glm-4.7-flash-GGUF	pp2048 @ d100000	122.08 ± 0.00	16803.78 ± 0.00	16775.39 ± 0.00	16803.81 ± 0.00
unsloth/glm-4.7-flash-GGUF	tg32 @ d100000	7.56 ± 0.00

./build/bin/llama-server
–hf-repo noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF
–host 0.0.0.0
–port 41447
-c 524288
-ngl 999
-t 20
-np 4
–cont-batching
-ub 2048
-fa 1
–jinja
–reasoning-format auto
–no-mmap

uvx llama-benchy
–base-url http://localhost:41447/v1
–model noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF
–pp 2048
–tg 32
–depth 0 4096 8192 16384 32768 65535 100000
–runs 1
–enable-prefix-caching
–latency-mode generation

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	pp2048	2254.97 ± 0.00	868.29 ± 0.00	841.70 ± 0.00	868.34 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	tg32	52.44 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_pp @ d4096	1909.84 ± 0.00	1953.46 ± 0.00	1926.86 ± 0.00	1953.49 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_tg @ d4096	45.53 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	pp2048 @ d4096	1471.85 ± 0.00	1418.04 ± 0.00	1391.45 ± 0.00	1418.08 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	tg32 @ d4096	43.53 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_pp @ d8192	1431.18 ± 0.00	5168.51 ± 0.00	5141.91 ± 0.00	5168.53 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_tg @ d8192	44.16 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	pp2048 @ d8192	1022.83 ± 0.00	2028.88 ± 0.00	2002.28 ± 0.00	2028.92 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	tg32 @ d8192	39.51 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_pp @ d16384	986.46 ± 0.00	14576.67 ± 0.00	14550.08 ± 0.00	14576.71 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_tg @ d16384	35.30 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	pp2048 @ d16384	677.58 ± 0.00	3049.12 ± 0.00	3022.53 ± 0.00	3049.16 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	tg32 @ d16384	36.21 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_pp @ d32768	592.33 ± 0.00	48857.25 ± 0.00	48830.66 ± 0.00	48857.29 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_tg @ d32768	15.79 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	pp2048 @ d32768	338.26 ± 0.00	6081.19 ± 0.00	6054.59 ± 0.00	6081.22 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	tg32 @ d32768	17.95 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_pp @ d65535	323.80 ± 0.00	179278.38 ± 0.00	179251.79 ± 0.00	179278.42 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_tg @ d65535	10.10 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	pp2048 @ d65535	184.81 ± 0.00	11108.39 ± 0.00	11081.79 ± 0.00	11108.42 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	tg32 @ d65535	10.88 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_pp @ d100000	220.53 ± 0.00	402260.75 ± 0.00	402234.15 ± 0.00	402260.79 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	ctx_tg @ d100000	7.00 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	pp2048 @ d100000	112.32 ± 0.00	18260.09 ± 0.00	18233.50 ± 0.00	18260.15 ± 0.00
noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF	tg32 @ d100000	6.94 ± 0.00

Topic		Replies	Views
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	13	869	January 7, 2026
Make GLM-4.7-Flash go BRRRRR DGX Spark / GB10	17	1156	February 5, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	3271	February 13, 2026
New tool: llama-benchy - llama-bench style benchmarking for ANY LLM backend (vLLM, SGLang, llama.cpp, etc.) DGX Spark / GB10 Projects llama	7	425	February 13, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	2919	January 2, 2026
Compiling llama.cpp DGX Spark / GB10 llama	14	772	February 7, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1040	January 7, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	139	4042	February 16, 2026
GLM4.6V NVFP4 existing? DGX Spark / GB10	7	351	January 22, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	1350	January 11, 2026

Llama.cpp GLM 4.7 Flash Benchmark

Related topics