GLM-5.2 on a 4× GB10 cluster: ~22 tok/s decode, 256K ctx

Got GLM-5.2 (GlmMoeDsa / DeepSeek-Sparse-Attention arch) serving on 4× GB10 over a 100G MikroTik CRS504 switch. It was quite the hassle to get DSA + MTP running, but then again Claude did 90% of the work : )

Model

cyankiwi/GLM-5.2-AWQ-INT4, with a data-free 15% routed-expert prune (256→218 experts/layer, dropping highest e_score_correction_bias — pure byte-level safetensors surgery on the INT4, no dequant). ~636B total / ~40B active. The prune frees ~12 GiB/node, which is what makes cudagraph + MTP + 256K KV all fit.

Results (llama-benchy 0.3.7, pp=2048 / tg=256, c=1)

Depth Decode (tg) Prefill (pp)
0 20.2 t/s 535 t/s
8K 21.9 t/s 517 t/s
32K 21.2 t/s 476 t/s

MTP gave the biggest uplift; CUDA Graph only made a 3% or so difference. With optimizations and b12x, I got prefill from 250 to ~500, but couldn’t get it any higher.

I’ve only tested a few prompts with the 15% prune in Pi, no idea on the real-world SWE performance yet so this is more of a proof of concept.

Awesome work, exciting to see GLM5.2 running on sparks. Have you thought about using an NVFP4 quant instead: madeby561/GLM-5.2-NVFP4-REAP-504B-term · Hugging Face

Apparently that’s a decent reap for coding according to people in the rtx 6000 discord. You should theoretically see better prefill with the hardware acceleration.

Could you also share your recipe?

I tried another REAP NVFP4. It ran slightly slower compared to the AWQ (~10 tok/s vs ~13 tok/s without MTP), and got itself into a loop over “What’s the capital of France?” So I didn’t continue down the REAP path. I might try this if I need longer ctx or if the 15% data-free prune I did lobotomized the model. I would love to see a gentler, 15 - 20% REAP though.

I’ll share the recipe and the patches after I confirm everything is working properly. I had to do a bunch of patches, quantize the mtp drafter to get it to work with the int4 awq, and prune the model, so I’m not exactly sure what the best way to share everything is.

What is the token gen speed oat 100k or 200k? Is the speculative model slowing it down over 100k? I can’t wait to try it. Thank you for your work on this.

I did try a 64k depth bench before I finished all the optimizations and IIRC the tg drop was not very significant. I think it went from ~21 to ~19, could just be run to run noise.

I’m not an expert on LLMs but judging from some of Claude’s CoT the model behaves a lot like DSV4 Flash when it comes to performance in longer context, probably because it also uses DSA.

cool. thank you