GLM-5.2 on a 4× GB10 cluster: ~22 tok/s decode, 256K ctx

CosmicRaisins · June 22, 2026, 1:01pm

Got GLM-5.2 (GlmMoeDsa / DeepSeek-Sparse-Attention arch) serving on 4× GB10 over a 100G MikroTik CRS504 switch. It was quite the hassle to get DSA + MTP running, but then again Claude did 90% of the work : )

Model

cyankiwi/GLM-5.2-AWQ-INT4, with a data-free 15% routed-expert prune (256→218 experts/layer, dropping highest e_score_correction_bias — pure byte-level safetensors surgery on the INT4, no dequant). ~636B total / ~40B active. The prune frees ~12 GiB/node, which is what makes cudagraph + MTP + 256K KV all fit.

Results (llama-benchy 0.3.7, pp=2048 / tg=256, c=1)

Depth	Decode (tg)	Prefill (pp)
0	20.2 t/s	535 t/s
8K	21.9 t/s	517 t/s
32K	21.2 t/s	476 t/s

MTP gave the biggest uplift; CUDA Graph only made a 3% or so difference. With optimizations and b12x, I got prefill from 250 to ~500, but couldn’t get it any higher.

I’ve only tested a few prompts with the 15% prune in Pi, no idea on the real-world SWE performance yet so this is more of a proof of concept.

p33zy · June 22, 2026, 1:33pm

Awesome work, exciting to see GLM5.2 running on sparks. Have you thought about using an NVFP4 quant instead: madeby561/GLM-5.2-NVFP4-REAP-504B-term · Hugging Face

Apparently that’s a decent reap for coding according to people in the rtx 6000 discord. You should theoretically see better prefill with the hardware acceleration.

Could you also share your recipe?

CosmicRaisins · June 22, 2026, 1:56pm

I tried another REAP NVFP4. It ran slightly slower compared to the AWQ (~10 tok/s vs ~13 tok/s without MTP), and got itself into a loop over “What’s the capital of France?” So I didn’t continue down the REAP path. I might try this if I need longer ctx or if the 15% data-free prune I did lobotomized the model. I would love to see a gentler, 15 - 20% REAP though.

I’ll share the recipe and the patches after I confirm everything is working properly. I had to do a bunch of patches, quantize the mtp drafter to get it to work with the int4 awq, and prune the model, so I’m not exactly sure what the best way to share everything is.

Topic		Replies	Views
GLM-5.2 IQ4_XS on 4× GB10 — 6.28 tok/s, DSA active, full recipe DGX Spark / GB10 Projects llama	4	819	June 21, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	260	June 19, 2026
DeepSeek v4 Flash (IQ2XXS) on a single GB10! DGX Spark / GB10 Projects llm , llama , deepseek	11	3530	June 15, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	39	2199	April 20, 2026
Qwen3.6-27B AWQ INT4 on DGX Spark (GB10) — only 1.8-4.9 tok/s decode with 285k token prompt, how to improve? DGX Spark / GB10	6	865	May 29, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	73	6696	June 20, 2026
50%+ Improvement on spark?! DGX Spark / GB10 cuda , deepseek	25	2412	March 24, 2026
MiMo-V2.5 (New model) DGX Spark / GB10	51	5352	May 24, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	252	16509	June 22, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	27	4338	January 2, 2026

GLM-5.2 on a 4× GB10 cluster: ~22 tok/s decode, 256K ctx

Model

Results (llama-benchy 0.3.7, pp=2048 / tg=256, c=1)

Related topics