Make GLM-4.7-Flash go BRRRRR

Hey everyone,
glad I found this community here of fellow nerds. Just recently got the gb10 version from ASUS and mostly trying to run models for local development and playing around with inference and training. Hopefully learning lots of things along they way :)

My goal here is to replicate this awesome work by @christopher_owen on making GPT-OSS 120B go as fast as possible. Ill be posting all the improvements I can find and updating this thread. So far support for GLM-4.7-Flash has not been great on consumer blackwell in general. I think its a great model, but lots of room for improvement. I started playing around for a few nights now and its getting more usable especially for long context. We went up to 13 t/s on 200k context.

Ill leave the quick and dirty way of replicating here, will update with some more info (maybe on GH?) in the coming days. The fixes itself are not too complicated and should be able to be replicated in minutes.

I use the standard container from scitrera no magic on that side

docker run -d --privileged --gpus all --rm --ipc=host --network host \
  --name glm \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  scitrera/dgx-spark-vllm:0.14.0-t5 \
  sleep infinity

This fixes the config to enable vllm to use an optimized backend for MLA

docker exec glm sed -i 's/"pangu_ultra_moe_mtp",/"pangu_ultra_moe_mtp",\n            "glm4_moe_lite",/' /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/model_arch_config_convertor.py

This fix is responsible for increasing long context tg, there is a hardcoded value for the number of kv splits and right now its 4, literally 4. So when you have long context like 64k it creates 4 x 32k splits which completely underutilises the SMs since they process sequentially within a chunk (to my knowledge and perf numbers seem to say the same). I tried setting a few different ones and landed on max(32, min( 128, max_seq_len / 1500)) expression. If its too high short context suffers a bit due to overhead.

docker exec glm sed -i 's/num_kv_splits = 1 if vllm_is_batch_invariant() else 4/# Dynamic splits: ~1.5K tokens per split, clamped to [32, 128]\n        max_seq_len = int(attn_metadata.decode.seq_lens.max().item())\n        num_kv_splits = 1 if vllm_is_batch_invariant() else max(32, min(128, max_seq_len \/\/ 1500))/' /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py

then just use docker exec it glm bash and run the command or use tmux first or whatever you like best. You can use both AWQ and NVFP4. AWQ is a bit faster and smaller I think.

vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
--gpu-memory-utilization 0.85 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--max-model-len 202752 \
--max-num-batched-tokens 4096 \
--max-num-seqs 64

The results below are actually from when I used flat out 64 as num_kv_splits, if you use the dynamic one then small context is even a bit faster, like 43~ ish

model test t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
glm-4.7-flash pp2048 6544.95 ± 62.05 360.34 ± 2.98 312.94 ± 2.98 360.40 ± 2.97
glm-4.7-flash tg32 40.98 ± 0.09
glm-4.7-flash ctx_pp @ d4096 6427.55 ± 5.34 684.66 ± 0.53 637.26 ± 0.53 684.72 ± 0.53
glm-4.7-flash ctx_tg @ d4096 39.12 ± 0.06
glm-4.7-flash pp2048 @ d4096 4181.95 ± 154.95 537.82 ± 18.62 490.41 ± 18.62 537.87 ± 18.62
glm-4.7-flash tg32 @ d4096 37.63 ± 0.09
glm-4.7-flash ctx_pp @ d8192 5277.83 ± 16.01 1599.57 ± 4.72 1552.17 ± 4.72 1599.62 ± 4.72
glm-4.7-flash ctx_tg @ d8192 36.15 ± 0.05
glm-4.7-flash pp2048 @ d8192 3194.30 ± 15.17 688.56 ± 3.05 641.16 ± 3.05 688.61 ± 3.04
glm-4.7-flash tg32 @ d8192 34.85 ± 0.05
glm-4.7-flash ctx_pp @ d16384 3813.59 ± 224.63 4359.22 ± 265.01 4311.82 ± 265.01 4359.27 ± 264.99
glm-4.7-flash ctx_tg @ d16384 33.00 ± 0.16
glm-4.7-flash pp2048 @ d16384 1908.54 ± 368.68 1168.95 ± 250.93 1121.55 ± 250.93 1168.99 ± 250.93
glm-4.7-flash tg32 @ d16384 33.27 ± 0.22
glm-4.7-flash ctx_pp @ d32768 2604.91 ± 45.24 12630.57 ± 221.22 12583.17 ± 221.22 12630.62 ± 221.21
glm-4.7-flash ctx_tg @ d32768 31.72 ± 0.20
glm-4.7-flash pp2048 @ d32768 1168.51 ± 147.47 1831.26 ± 247.16 1783.86 ± 247.16 1831.31 ± 247.15
glm-4.7-flash tg32 @ d32768 31.39 ± 0.16
glm-4.7-flash ctx_pp @ d65535 1559.22 ± 8.77 42079.41 ± 236.69 42032.01 ± 236.69 42079.45 ± 236.67
glm-4.7-flash ctx_tg @ d65535 25.26 ± 0.06
glm-4.7-flash pp2048 @ d65535 656.00 ± 54.39 3192.31 ± 277.00 3144.91 ± 277.00 3192.36 ± 276.99
glm-4.7-flash tg32 @ d65535 25.13 ± 0.03
glm-4.7-flash ctx_pp @ d100000 1081.93 ± 2.66 92475.75 ± 227.14 92428.35 ± 227.14 92475.80 ± 227.12
glm-4.7-flash ctx_tg @ d100000 20.80 ± 0.03
glm-4.7-flash pp2048 @ d100000 452.89 ± 22.06 4580.57 ± 228.66 4533.17 ± 228.66 4580.65 ± 228.66
glm-4.7-flash tg32 @ d100000 20.76 ± 0.03
glm-4.7-flash ctx_pp @ d125000 871.94 ± 1.01 143406.82 ± 165.43 143359.42 ± 165.43 143406.88 ± 165.43
glm-4.7-flash ctx_tg @ d125000 17.94 ± 0.03
glm-4.7-flash pp2048 @ d125000 369.72 ± 16.02 5597.45 ± 248.04 5550.05 ± 248.04 5597.51 ± 248.03
glm-4.7-flash tg32 @ d125000 17.86 ± 0.02
glm-4.7-flash ctx_pp @ d150000 745.40 ± 4.34 201289.13 ± 1177.88 201241.73 ± 1177.88 201289.19 ± 1177.87
glm-4.7-flash ctx_tg @ d150000 15.85 ± 0.02
glm-4.7-flash pp2048 @ d150000 310.65 ± 12.11 6650.27 ± 264.77 6602.87 ± 264.77 6650.33 ± 264.77
glm-4.7-flash tg32 @ d150000 15.70 ± 0.03
glm-4.7-flash ctx_pp @ d180000 633.11 ± 0.10 284357.49 ± 46.25 284310.08 ± 46.25 284357.55 ± 46.25
glm-4.7-flash ctx_tg @ d180000 14.78 ± 0.02
glm-4.7-flash pp2048 @ d180000 261.92 ± 7.81 7873.69 ± 238.38 7826.29 ± 238.38 7873.75 ± 238.39
glm-4.7-flash tg32 @ d180000 13.91 ± 0.02
glm-4.7-flash ctx_pp @ d195000 591.02 ± 1.88 329991.54 ± 1051.05 329944.14 ± 1051.05 329991.60 ± 1051.06
glm-4.7-flash ctx_tg @ d195000 13.63 ± 0.00
glm-4.7-flash pp2048 @ d195000 216.90 ± 0.54 9489.54 ± 23.37 9442.14 ± 23.37 9489.59 ± 23.38
glm-4.7-flash tg32 @ d195000 13.50 ± 0.01

welcome! I’ll be following along closely!

This is nice!

I’ve implemented this patch as a mod in GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks.

To run, pull from the repository first.

To use the mod, first build the container with Transformers 5 support (--pre-tf) flag, e.g.:

./build-and-copy.sh -t vllm-node-tf5 --use-wheels --pre-tf -c

Drop --use-wheels if you experience an error during build (see the annoucement in the Quick Start section).

Then, to run on a single node:

./launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod mods/fix-glm-4.7-flash-AWQ \
  exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 202752 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 64 \
  --host 0.0.0.0 --port 8888 \
  --gpu-memory-utilization 0.7

To run on cluster:

./launch-cluster.sh -t vllm-node-tf5 \
  --apply-mod mods/fix-glm-4.7-flash-AWQ \
  exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 202752 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 64 \
  --host 0.0.0.0 --port 8888 \
  --gpu-memory-utilization 0.7 \
  --distributed-executor-backend ray \
  --tensor-parallel-size 2

NOTE: vLLM implementation is suboptimal even with the patch. The model performance is still significantly slower than it should be for the model with this number of active parameters. Running in the cluster increases prompt processing performance, but not token generation. You can expect ~40 t/s generation speed in both single node and cluster.

thats, nice. I’ll need to take a better look at what community projects are out there and how to best integrate current and upcoming fixes.

Your right, lots of headroom still. The main problem is proper kernel support for SM121, the most unfortunate one is the bug in cutlass that prevents using nvpf4, so instead of doing native fp4 computation it uses Marlin for example to covert to bf16 and do calculations. KV cache is also in bf16 so loading times get quite high for long context.

Ill be playing around over the weekend seeing how difficult it is to move those over to custom triton kernels. Ive never used triton before and only played around with cuda a little bit. Gonna be a fun exercise. If we had a fused triton kernel for fp8 MLA that should already speed things up a lot for long context. Short context would need custom MOE kernels that are faster than current int4->bf16 calculations we do in Marlin. But sounds more complex. Will see

Have you had a look at Christopher’s work on MXFP4 optimizations?
Also, there is this PR: feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix Ā· Pull Request #31740 Ā· vllm-project/vllm Ā· GitHub

I’m curious about this value. I was asking Gemini to to explain your flags to me, and it bawked at max-num-batched-tokens telling me I should consider having it match max-model-len(!) because ā€œOn Blackwell, you want the prefill to be fast. Matching this to your max model length ensures that long documents are processed in as few chunks as possible.ā€. The vllm docs didn’t make things any clearer to me. I trust your numbers more than I just Geminis, but I’m curious if you have thoughts on why it’s suggesting something so wildly different here.

This wasn’t my recommendation, I just used the one from OP, but generally you don’t want to have it too high as well, definitely not up to max-model-len. It just consumes more memory and slows down shorter requests. Generally, unless your workflow is sending high number of requests/large requests, you want to keep batch size below 8192. It is model dependent, but for most models, 2048-4096 is the sweet spot.

I usually don’t even set it, just leave it default.

Got it, thank you for the explanation! :)

Hi,

I tried this and it finished

./build-and-copy.sh -t vllm-node-tf5 --use-wheels --pre-tf

but when I run this:

./launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod mods/fix-glm-4.7-flash-AWQ \
  exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 202752 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 64 \
  --host 0.0.0.0 --port 30000 \
  --gpu-memory-utilization 0.7

I get this:

Auto-detecting interfaces...
Error: No active IB interfaces found.

What might it be? I only have one HP Spark variant.

Ah, thanks, I need to skip interface detection altogether if --solo switch is used. I’ll publish a fix shortly.

Fixed, could you pull the changes and try again, please? No need to rebuild the container itself, I only changed the launch script.

That was fast! Thank you!

Script runs and it’s pulling the model now. Will update this when it’s running.

Edit: success!

(APIServer pid=50) INFO:     Started server process [50]
(APIServer pid=50) INFO:     Waiting for application startup.
(APIServer pid=50) INFO:     Application startup complete.
(APIServer pid=50) INFO:     127.0.0.1:51966 - "GET /v1/models HTTP/1.1" 200 OK

@eugr sorry if this is a silly question, but how do I start this detached so it’s just available in the BG? I tried changing exec to start and adding -d for DAEMON_MODE after reading the launch_cluster.sh script, however when I run:

./launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod mods/fix-glm-4.7-flash-AWQ \
  -d start vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 202752 \
  --max-num-seqs 64 \
  --host 0.0.0.0 --port 8888 \
  --gpu-memory-utilization 0.7

It’s still running in my terminal (I’m not detached from it). If I hit Ctrl+C then it shuts down. I was aiming for something equivalent to docker run -d.

I added echo "Daemon mode?: $DAEMON_MODE" where it prints the modes it’s running in, and that’s definitely set to true.

Thank you!

Oh, I found the issue… The ā€œvllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bitā€ causes it to fall into this code which forces ACTION=exec. I’ll just remove this for now, but is this a bug?

        *) 
            # If it's not a flag and not a known action, treat as exec command for backward compatibility
            # unless it's the default 'start' implied.
            # However, to support "omitted" = start, we need to be careful.
            # If the arg looks like a command, it's exec.
            ACTION="exec"
            COMMAND_TO_RUN="$@"
            break 
            ;;

Edit: Removing Action="exec" did start the container in the background, but it doesn’t seem to actually start up. Last line of the log is ā€œWaiting for mod application…
Mod applied, starting containerā€¦ā€ but nothing else ever happens.

If you use start, you can’t usevllm serve ..., otherwise it treats it as exec. To start vllm, you will need to run docker exec ... separately. Exec always runs in interactive mode (so you could kill the cluster by pressing ctrl-c).

So you need to start the container first:

./launch-cluster.sh -t vllm-node-20260129-tf5 --solo   --apply-mod mods/fix-glm-4.7-flash-AWQ   -d start

Then start vllm (you can ctrl-c and it will not stop vllm or container, but you will lose startup logs):

docker exec vllm_node  bash -i -c "vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit   --tool-call-parser glm47   --reasoning-parser glm45   --enable-auto-tool-choice   --served-model-name glm-4.7-flash   --max-model-len 202752   --max-num-seqs 64   --host 0.0.0.0 --port 8888   --gpu-memory-utilization 0.7"

To stop:

./launch-cluster.sh -t vllm-node-20260129-tf5 --solo  stop

I just run everything inside a tmux session. This way I can detach and disconnect and not lose my terminal sessions. For production use, I start my vllm containers via llama-swap.

Ah, got it - that worked, thanks :)

I added > /proc/1/fd/1 2> /proc/1/fd/2 to the end of the command going to docker exec, so now the output shows up in docker logs.

I’m still a bit of a tmux noob šŸ˜„ though I like to just have containers for everything and stop/start them (I have a list and buttons on my dashboard). If this model works well, I’ll probably just move what’s in your scripts inside the container so I can just docker run -d it to create it, then docker stop and docker start (via the dashboard) when I want to use it.

Thanks for all your work on this - I had lost many hours trying to get this model to work (with Gemini’s help) previously without much luck šŸ™ƒ

I am still working on this but using some time to relearn some forgotten pytorch skills / and learning triton to be able to write those missing kernels myself. But its the perfect motivation to invest some time there :)

I am running this model using ollama on my (single) ASUS GX10 and trying to configure the model file so that it runs with Claude Code. Is this an issue? I see everyone here using docker containers and serving the model up directly. I use ollama because I am experimenting with model performance first.

Currently my model file is:

FROM glm-4.7-flash:q4_K_M

PARAMETER num_ctx        32768
PARAMETER num_batch      2048
PARAMETER num_gpu        999
PARAMETER num_thread     16
PARAMETER use_mmap       false

PARAMETER temperature    0.2
PARAMETER top_p          0.90
PARAMETER top_k          10
PARAMETER repeat_penalty 1.05
PARAMETER min_p          0.01

RENDERER glm-4.7
PARSER glm-4.7

SYSTEM ā€œā€"
You are an expert software engineer operating as an autonomous coding agent.

CRITICAL BEHAVIOR RULES:

When given a task, execute it immediately and completely. Do not ask for confirmation.

Never ask ā€œwould you like me to proceed?ā€ — just proceed.

Never ask ā€œwhat would you like me to do?ā€ — the user already told you.

If a task covers multiple files, work through all of them without stopping to check in.

Only ask a question if the task is genuinely ambiguous and you cannot make a reasonable default choice.

Prefer action over clarification. When in doubt, do the most reasonable thing and explain what you did.
ā€œā€"