GLM-5.2 on a 4× GB10 cluster: ~22 tok/s decode, 256K ctx

Got GLM-5.2 (GlmMoeDsa / DeepSeek-Sparse-Attention arch) serving on 4× GB10 over a 100G MikroTik CRS504 switch. It was quite the hassle to get DSA + MTP running, but then again Claude did 90% of the work : )

Model

cyankiwi/GLM-5.2-AWQ-INT4, with a data-free 15% routed-expert prune (256→218 experts/layer, dropping highest e_score_correction_bias — pure byte-level safetensors surgery on the INT4, no dequant). ~636B total / ~40B active. The prune frees ~12 GiB/node, which is what makes cudagraph + MTP + 256K KV all fit.

Results (llama-benchy 0.3.7, pp=2048 / tg=256, c=1)

Depth Decode (tg) Prefill (pp)
0 20.2 t/s 535 t/s
8K 21.9 t/s 517 t/s
32K 21.2 t/s 476 t/s

MTP gave the biggest uplift; CUDA Graph only made a 3% or so difference. With optimizations and b12x, I got prefill from 250 to ~500, but couldn’t get it any higher.

I’ve only tested a few prompts with the 15% prune in Pi, no idea on the real-world SWE performance yet so this is more of a proof of concept.

Awesome work, exciting to see GLM5.2 running on sparks. Have you thought about using an NVFP4 quant instead: madeby561/GLM-5.2-NVFP4-REAP-504B-term · Hugging Face

Apparently that’s a decent reap for coding according to people in the rtx 6000 discord. You should theoretically see better prefill with the hardware acceleration.

Could you also share your recipe?

I tried another REAP NVFP4. It ran slightly slower compared to the AWQ (~10 tok/s vs ~13 tok/s without MTP), and got itself into a loop over “What’s the capital of France?” So I didn’t continue down the REAP path. I might try this if I need longer ctx or if the 15% data-free prune I did lobotomized the model. I would love to see a gentler, 15 - 20% REAP though.

I’ll share the recipe and the patches after I confirm everything is working properly. I had to do a bunch of patches, quantize the mtp drafter to get it to work with the int4 awq, and prune the model, so I’m not exactly sure what the best way to share everything is.