Hey everyone,
glad I found this community here of fellow nerds. Just recently got the gb10 version from ASUS and mostly trying to run models for local development and playing around with inference and training. Hopefully learning lots of things along they way :)
My goal here is to replicate this awesome work by @christopher_owen on making GPT-OSS 120B go as fast as possible. Ill be posting all the improvements I can find and updating this thread. So far support for GLM-4.7-Flash has not been great on consumer blackwell in general. I think its a great model, but lots of room for improvement. I started playing around for a few nights now and its getting more usable especially for long context. We went up to 13 t/s on 200k context.
Ill leave the quick and dirty way of replicating here, will update with some more info (maybe on GH?) in the coming days. The fixes itself are not too complicated and should be able to be replicated in minutes.
I use the standard container from scitrera no magic on that side
docker run -d --privileged --gpus all --rm --ipc=host --network host \
--name glm \
-v ~/.cache/huggingface:/root/.cache/huggingface \
scitrera/dgx-spark-vllm:0.14.0-t5 \
sleep infinity
This fixes the config to enable vllm to use an optimized backend for MLA
docker exec glm sed -i 's/"pangu_ultra_moe_mtp",/"pangu_ultra_moe_mtp",\n "glm4_moe_lite",/' /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/model_arch_config_convertor.py
This fix is responsible for increasing long context tg, there is a hardcoded value for the number of kv splits and right now its 4, literally 4. So when you have long context like 64k it creates 4 x 32k splits which completely underutilises the SMs since they process sequentially within a chunk (to my knowledge and perf numbers seem to say the same). I tried setting a few different ones and landed on max(32, min( 128, max_seq_len / 1500)) expression. If its too high short context suffers a bit due to overhead.
docker exec glm sed -i 's/num_kv_splits = 1 if vllm_is_batch_invariant() else 4/# Dynamic splits: ~1.5K tokens per split, clamped to [32, 128]\n max_seq_len = int(attn_metadata.decode.seq_lens.max().item())\n num_kv_splits = 1 if vllm_is_batch_invariant() else max(32, min(128, max_seq_len \/\/ 1500))/' /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py
then just use docker exec it glm bash and run the command or use tmux first or whatever you like best. You can use both AWQ and NVFP4. AWQ is a bit faster and smaller I think.
vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
--gpu-memory-utilization 0.85 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--max-model-len 202752 \
--max-num-batched-tokens 4096 \
--max-num-seqs 64
The results below are actually from when I used flat out 64 as num_kv_splits, if you use the dynamic one then small context is even a bit faster, like 43~ ish
| model | test | t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|
| glm-4.7-flash | pp2048 | 6544.95 ± 62.05 | 360.34 ± 2.98 | 312.94 ± 2.98 | 360.40 ± 2.97 |
| glm-4.7-flash | tg32 | 40.98 ± 0.09 | |||
| glm-4.7-flash | ctx_pp @ d4096 | 6427.55 ± 5.34 | 684.66 ± 0.53 | 637.26 ± 0.53 | 684.72 ± 0.53 |
| glm-4.7-flash | ctx_tg @ d4096 | 39.12 ± 0.06 | |||
| glm-4.7-flash | pp2048 @ d4096 | 4181.95 ± 154.95 | 537.82 ± 18.62 | 490.41 ± 18.62 | 537.87 ± 18.62 |
| glm-4.7-flash | tg32 @ d4096 | 37.63 ± 0.09 | |||
| glm-4.7-flash | ctx_pp @ d8192 | 5277.83 ± 16.01 | 1599.57 ± 4.72 | 1552.17 ± 4.72 | 1599.62 ± 4.72 |
| glm-4.7-flash | ctx_tg @ d8192 | 36.15 ± 0.05 | |||
| glm-4.7-flash | pp2048 @ d8192 | 3194.30 ± 15.17 | 688.56 ± 3.05 | 641.16 ± 3.05 | 688.61 ± 3.04 |
| glm-4.7-flash | tg32 @ d8192 | 34.85 ± 0.05 | |||
| glm-4.7-flash | ctx_pp @ d16384 | 3813.59 ± 224.63 | 4359.22 ± 265.01 | 4311.82 ± 265.01 | 4359.27 ± 264.99 |
| glm-4.7-flash | ctx_tg @ d16384 | 33.00 ± 0.16 | |||
| glm-4.7-flash | pp2048 @ d16384 | 1908.54 ± 368.68 | 1168.95 ± 250.93 | 1121.55 ± 250.93 | 1168.99 ± 250.93 |
| glm-4.7-flash | tg32 @ d16384 | 33.27 ± 0.22 | |||
| glm-4.7-flash | ctx_pp @ d32768 | 2604.91 ± 45.24 | 12630.57 ± 221.22 | 12583.17 ± 221.22 | 12630.62 ± 221.21 |
| glm-4.7-flash | ctx_tg @ d32768 | 31.72 ± 0.20 | |||
| glm-4.7-flash | pp2048 @ d32768 | 1168.51 ± 147.47 | 1831.26 ± 247.16 | 1783.86 ± 247.16 | 1831.31 ± 247.15 |
| glm-4.7-flash | tg32 @ d32768 | 31.39 ± 0.16 | |||
| glm-4.7-flash | ctx_pp @ d65535 | 1559.22 ± 8.77 | 42079.41 ± 236.69 | 42032.01 ± 236.69 | 42079.45 ± 236.67 |
| glm-4.7-flash | ctx_tg @ d65535 | 25.26 ± 0.06 | |||
| glm-4.7-flash | pp2048 @ d65535 | 656.00 ± 54.39 | 3192.31 ± 277.00 | 3144.91 ± 277.00 | 3192.36 ± 276.99 |
| glm-4.7-flash | tg32 @ d65535 | 25.13 ± 0.03 | |||
| glm-4.7-flash | ctx_pp @ d100000 | 1081.93 ± 2.66 | 92475.75 ± 227.14 | 92428.35 ± 227.14 | 92475.80 ± 227.12 |
| glm-4.7-flash | ctx_tg @ d100000 | 20.80 ± 0.03 | |||
| glm-4.7-flash | pp2048 @ d100000 | 452.89 ± 22.06 | 4580.57 ± 228.66 | 4533.17 ± 228.66 | 4580.65 ± 228.66 |
| glm-4.7-flash | tg32 @ d100000 | 20.76 ± 0.03 | |||
| glm-4.7-flash | ctx_pp @ d125000 | 871.94 ± 1.01 | 143406.82 ± 165.43 | 143359.42 ± 165.43 | 143406.88 ± 165.43 |
| glm-4.7-flash | ctx_tg @ d125000 | 17.94 ± 0.03 | |||
| glm-4.7-flash | pp2048 @ d125000 | 369.72 ± 16.02 | 5597.45 ± 248.04 | 5550.05 ± 248.04 | 5597.51 ± 248.03 |
| glm-4.7-flash | tg32 @ d125000 | 17.86 ± 0.02 | |||
| glm-4.7-flash | ctx_pp @ d150000 | 745.40 ± 4.34 | 201289.13 ± 1177.88 | 201241.73 ± 1177.88 | 201289.19 ± 1177.87 |
| glm-4.7-flash | ctx_tg @ d150000 | 15.85 ± 0.02 | |||
| glm-4.7-flash | pp2048 @ d150000 | 310.65 ± 12.11 | 6650.27 ± 264.77 | 6602.87 ± 264.77 | 6650.33 ± 264.77 |
| glm-4.7-flash | tg32 @ d150000 | 15.70 ± 0.03 | |||
| glm-4.7-flash | ctx_pp @ d180000 | 633.11 ± 0.10 | 284357.49 ± 46.25 | 284310.08 ± 46.25 | 284357.55 ± 46.25 |
| glm-4.7-flash | ctx_tg @ d180000 | 14.78 ± 0.02 | |||
| glm-4.7-flash | pp2048 @ d180000 | 261.92 ± 7.81 | 7873.69 ± 238.38 | 7826.29 ± 238.38 | 7873.75 ± 238.39 |
| glm-4.7-flash | tg32 @ d180000 | 13.91 ± 0.02 | |||
| glm-4.7-flash | ctx_pp @ d195000 | 591.02 ± 1.88 | 329991.54 ± 1051.05 | 329944.14 ± 1051.05 | 329991.60 ± 1051.06 |
| glm-4.7-flash | ctx_tg @ d195000 | 13.63 ± 0.00 | |||
| glm-4.7-flash | pp2048 @ d195000 | 216.90 ± 0.54 | 9489.54 ± 23.37 | 9442.14 ± 23.37 | 9489.59 ± 23.38 |
| glm-4.7-flash | tg32 @ d195000 | 13.50 ± 0.01 |