Make GLM-4.7-Flash go BRRRRR

mjpansa · January 29, 2026, 10:32pm

Hey everyone,
glad I found this community here of fellow nerds. Just recently got the gb10 version from ASUS and mostly trying to run models for local development and playing around with inference and training. Hopefully learning lots of things along they way :)

My goal here is to replicate this awesome work by @christopher_owen on making GPT-OSS 120B go as fast as possible. Ill be posting all the improvements I can find and updating this thread. So far support for GLM-4.7-Flash has not been great on consumer blackwell in general. I think its a great model, but lots of room for improvement. I started playing around for a few nights now and its getting more usable especially for long context. We went up to 13 t/s on 200k context.

Ill leave the quick and dirty way of replicating here, will update with some more info (maybe on GH?) in the coming days. The fixes itself are not too complicated and should be able to be replicated in minutes.

I use the standard container from scitrera no magic on that side

docker run -d --privileged --gpus all --rm --ipc=host --network host \
  --name glm \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  scitrera/dgx-spark-vllm:0.14.0-t5 \
  sleep infinity

This fixes the config to enable vllm to use an optimized backend for MLA

docker exec glm sed -i 's/"pangu_ultra_moe_mtp",/"pangu_ultra_moe_mtp",\n            "glm4_moe_lite",/' /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/model_arch_config_convertor.py

This fix is responsible for increasing long context tg, there is a hardcoded value for the number of kv splits and right now its 4, literally 4. So when you have long context like 64k it creates 4 x 32k splits which completely underutilises the SMs since they process sequentially within a chunk (to my knowledge and perf numbers seem to say the same). I tried setting a few different ones and landed on max(32, min( 128, max_seq_len / 1500)) expression. If its too high short context suffers a bit due to overhead.

docker exec glm sed -i 's/num_kv_splits = 1 if vllm_is_batch_invariant() else 4/# Dynamic splits: ~1.5K tokens per split, clamped to [32, 128]\n        max_seq_len = int(attn_metadata.decode.seq_lens.max().item())\n        num_kv_splits = 1 if vllm_is_batch_invariant() else max(32, min(128, max_seq_len \/\/ 1500))/' /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py

then just use docker exec it glm bash and run the command or use tmux first or whatever you like best. You can use both AWQ and NVFP4. AWQ is a bit faster and smaller I think.

vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
--gpu-memory-utilization 0.85 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--max-model-len 202752 \
--max-num-batched-tokens 4096 \
--max-num-seqs 64

The results below are actually from when I used flat out 64 as num_kv_splits, if you use the dynamic one then small context is even a bit faster, like 43~ ish

model	test	t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
glm-4.7-flash	pp2048	6544.95 ± 62.05	360.34 ± 2.98	312.94 ± 2.98	360.40 ± 2.97
glm-4.7-flash	tg32	40.98 ± 0.09
glm-4.7-flash	ctx_pp @ d4096	6427.55 ± 5.34	684.66 ± 0.53	637.26 ± 0.53	684.72 ± 0.53
glm-4.7-flash	ctx_tg @ d4096	39.12 ± 0.06
glm-4.7-flash	pp2048 @ d4096	4181.95 ± 154.95	537.82 ± 18.62	490.41 ± 18.62	537.87 ± 18.62
glm-4.7-flash	tg32 @ d4096	37.63 ± 0.09
glm-4.7-flash	ctx_pp @ d8192	5277.83 ± 16.01	1599.57 ± 4.72	1552.17 ± 4.72	1599.62 ± 4.72
glm-4.7-flash	ctx_tg @ d8192	36.15 ± 0.05
glm-4.7-flash	pp2048 @ d8192	3194.30 ± 15.17	688.56 ± 3.05	641.16 ± 3.05	688.61 ± 3.04
glm-4.7-flash	tg32 @ d8192	34.85 ± 0.05
glm-4.7-flash	ctx_pp @ d16384	3813.59 ± 224.63	4359.22 ± 265.01	4311.82 ± 265.01	4359.27 ± 264.99
glm-4.7-flash	ctx_tg @ d16384	33.00 ± 0.16
glm-4.7-flash	pp2048 @ d16384	1908.54 ± 368.68	1168.95 ± 250.93	1121.55 ± 250.93	1168.99 ± 250.93
glm-4.7-flash	tg32 @ d16384	33.27 ± 0.22
glm-4.7-flash	ctx_pp @ d32768	2604.91 ± 45.24	12630.57 ± 221.22	12583.17 ± 221.22	12630.62 ± 221.21
glm-4.7-flash	ctx_tg @ d32768	31.72 ± 0.20
glm-4.7-flash	pp2048 @ d32768	1168.51 ± 147.47	1831.26 ± 247.16	1783.86 ± 247.16	1831.31 ± 247.15
glm-4.7-flash	tg32 @ d32768	31.39 ± 0.16
glm-4.7-flash	ctx_pp @ d65535	1559.22 ± 8.77	42079.41 ± 236.69	42032.01 ± 236.69	42079.45 ± 236.67
glm-4.7-flash	ctx_tg @ d65535	25.26 ± 0.06
glm-4.7-flash	pp2048 @ d65535	656.00 ± 54.39	3192.31 ± 277.00	3144.91 ± 277.00	3192.36 ± 276.99
glm-4.7-flash	tg32 @ d65535	25.13 ± 0.03
glm-4.7-flash	ctx_pp @ d100000	1081.93 ± 2.66	92475.75 ± 227.14	92428.35 ± 227.14	92475.80 ± 227.12
glm-4.7-flash	ctx_tg @ d100000	20.80 ± 0.03
glm-4.7-flash	pp2048 @ d100000	452.89 ± 22.06	4580.57 ± 228.66	4533.17 ± 228.66	4580.65 ± 228.66
glm-4.7-flash	tg32 @ d100000	20.76 ± 0.03
glm-4.7-flash	ctx_pp @ d125000	871.94 ± 1.01	143406.82 ± 165.43	143359.42 ± 165.43	143406.88 ± 165.43
glm-4.7-flash	ctx_tg @ d125000	17.94 ± 0.03
glm-4.7-flash	pp2048 @ d125000	369.72 ± 16.02	5597.45 ± 248.04	5550.05 ± 248.04	5597.51 ± 248.03
glm-4.7-flash	tg32 @ d125000	17.86 ± 0.02
glm-4.7-flash	ctx_pp @ d150000	745.40 ± 4.34	201289.13 ± 1177.88	201241.73 ± 1177.88	201289.19 ± 1177.87
glm-4.7-flash	ctx_tg @ d150000	15.85 ± 0.02
glm-4.7-flash	pp2048 @ d150000	310.65 ± 12.11	6650.27 ± 264.77	6602.87 ± 264.77	6650.33 ± 264.77
glm-4.7-flash	tg32 @ d150000	15.70 ± 0.03
glm-4.7-flash	ctx_pp @ d180000	633.11 ± 0.10	284357.49 ± 46.25	284310.08 ± 46.25	284357.55 ± 46.25
glm-4.7-flash	ctx_tg @ d180000	14.78 ± 0.02
glm-4.7-flash	pp2048 @ d180000	261.92 ± 7.81	7873.69 ± 238.38	7826.29 ± 238.38	7873.75 ± 238.39
glm-4.7-flash	tg32 @ d180000	13.91 ± 0.02
glm-4.7-flash	ctx_pp @ d195000	591.02 ± 1.88	329991.54 ± 1051.05	329944.14 ± 1051.05	329991.60 ± 1051.06
glm-4.7-flash	ctx_tg @ d195000	13.63 ± 0.00
glm-4.7-flash	pp2048 @ d195000	216.90 ± 0.54	9489.54 ± 23.37	9442.14 ± 23.37	9489.59 ± 23.38
glm-4.7-flash	tg32 @ d195000	13.50 ± 0.01

christopher_owen · January 30, 2026, 12:38am

welcome! I’ll be following along closely!

eugr · January 30, 2026, 2:18am

This is nice!

I’ve implemented this patch as a mod in GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks.

To run, pull from the repository first.

To use the mod, first build the container with Transformers 5 support (--pre-tf) flag, e.g.:

./build-and-copy.sh -t vllm-node-tf5 --use-wheels --pre-tf -c

Drop --use-wheels if you experience an error during build (see the annoucement in the Quick Start section).

Then, to run on a single node:

./launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod mods/fix-glm-4.7-flash-AWQ \
  exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 202752 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 64 \
  --host 0.0.0.0 --port 8888 \
  --gpu-memory-utilization 0.7

To run on cluster:

./launch-cluster.sh -t vllm-node-tf5 \
  --apply-mod mods/fix-glm-4.7-flash-AWQ \
  exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 202752 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 64 \
  --host 0.0.0.0 --port 8888 \
  --gpu-memory-utilization 0.7 \
  --distributed-executor-backend ray \
  --tensor-parallel-size 2

NOTE: vLLM implementation is suboptimal even with the patch. The model performance is still significantly slower than it should be for the model with this number of active parameters. Running in the cluster increases prompt processing performance, but not token generation. You can expect ~40 t/s generation speed in both single node and cluster.

mjpansa · January 30, 2026, 9:17am

thats, nice. I’ll need to take a better look at what community projects are out there and how to best integrate current and upcoming fixes.

Your right, lots of headroom still. The main problem is proper kernel support for SM121, the most unfortunate one is the bug in cutlass that prevents using nvpf4, so instead of doing native fp4 computation it uses Marlin for example to covert to bf16 and do calculations. KV cache is also in bf16 so loading times get quite high for long context.

Ill be playing around over the weekend seeing how difficult it is to move those over to custom triton kernels. Ive never used triton before and only played around with cuda a little bit. Gonna be a fun exercise. If we had a fused triton kernel for fp8 MLA that should already speed things up a lot for long context. Short context would need custom MOE kernels that are faster than current int4->bf16 calculations we do in Marlin. But sounds more complex. Will see

eugr · January 30, 2026, 5:15pm

Have you had a look at Christopher’s work on MXFP4 optimizations?
Also, there is this PR: feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm · GitHub

DannyTup · January 30, 2026, 5:35pm

I’m curious about this value. I was asking Gemini to to explain your flags to me, and it bawked at max-num-batched-tokens telling me I should consider having it match max-model-len(!) because “On Blackwell, you want the prefill to be fast. Matching this to your max model length ensures that long documents are processed in as few chunks as possible.”. The vllm docs didn’t make things any clearer to me. I trust your numbers more than I just Geminis, but I’m curious if you have thoughts on why it’s suggesting something so wildly different here.

eugr · January 30, 2026, 5:43pm

This wasn’t my recommendation, I just used the one from OP, but generally you don’t want to have it too high as well, definitely not up to max-model-len. It just consumes more memory and slows down shorter requests. Generally, unless your workflow is sending high number of requests/large requests, you want to keep batch size below 8192. It is model dependent, but for most models, 2048-4096 is the sweet spot.

I usually don’t even set it, just leave it default.

DannyTup · January 30, 2026, 5:56pm

Got it, thank you for the explanation! :)

tatamiso · January 31, 2026, 12:08am

Hi,

I tried this and it finished

./build-and-copy.sh -t vllm-node-tf5 --use-wheels --pre-tf

but when I run this:

./launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod mods/fix-glm-4.7-flash-AWQ \
  exec vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 202752 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 64 \
  --host 0.0.0.0 --port 30000 \
  --gpu-memory-utilization 0.7

I get this:

Auto-detecting interfaces...
Error: No active IB interfaces found.

What might it be? I only have one HP Spark variant.

eugr · January 31, 2026, 12:28am

Ah, thanks, I need to skip interface detection altogether if --solo switch is used. I’ll publish a fix shortly.

eugr · January 31, 2026, 12:40am

Fixed, could you pull the changes and try again, please? No need to rebuild the container itself, I only changed the launch script.

tatamiso · January 31, 2026, 12:48am

That was fast! Thank you!

Script runs and it’s pulling the model now. Will update this when it’s running.

Edit: success!

(APIServer pid=50) INFO:     Started server process [50]
(APIServer pid=50) INFO:     Waiting for application startup.
(APIServer pid=50) INFO:     Application startup complete.
(APIServer pid=50) INFO:     127.0.0.1:51966 - "GET /v1/models HTTP/1.1" 200 OK

DannyTup · January 31, 2026, 11:57am

@eugr sorry if this is a silly question, but how do I start this detached so it’s just available in the BG? I tried changing exec to start and adding -d for DAEMON_MODE after reading the launch_cluster.sh script, however when I run:

./launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod mods/fix-glm-4.7-flash-AWQ \
  -d start vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --max-model-len 202752 \
  --max-num-seqs 64 \
  --host 0.0.0.0 --port 8888 \
  --gpu-memory-utilization 0.7

It’s still running in my terminal (I’m not detached from it). If I hit Ctrl+C then it shuts down. I was aiming for something equivalent to docker run -d.

I added echo "Daemon mode?: $DAEMON_MODE" where it prints the modes it’s running in, and that’s definitely set to true.

Thank you!

DannyTup · January 31, 2026, 12:04pm

Oh, I found the issue… The “vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit” causes it to fall into this code which forces ACTION=exec. I’ll just remove this for now, but is this a bug?

        *) 
            # If it's not a flag and not a known action, treat as exec command for backward compatibility
            # unless it's the default 'start' implied.
            # However, to support "omitted" = start, we need to be careful.
            # If the arg looks like a command, it's exec.
            ACTION="exec"
            COMMAND_TO_RUN="$@"
            break 
            ;;

Edit: Removing Action="exec" did start the container in the background, but it doesn’t seem to actually start up. Last line of the log is “Waiting for mod application…
Mod applied, starting container…” but nothing else ever happens.

eugr · January 31, 2026, 6:11pm

If you use start, you can’t usevllm serve ..., otherwise it treats it as exec. To start vllm, you will need to run docker exec ... separately. Exec always runs in interactive mode (so you could kill the cluster by pressing ctrl-c).

So you need to start the container first:

./launch-cluster.sh -t vllm-node-20260129-tf5 --solo   --apply-mod mods/fix-glm-4.7-flash-AWQ   -d start

Then start vllm (you can ctrl-c and it will not stop vllm or container, but you will lose startup logs):

docker exec vllm_node  bash -i -c "vllm serve cyankiwi/GLM-4.7-Flash-AWQ-4bit   --tool-call-parser glm47   --reasoning-parser glm45   --enable-auto-tool-choice   --served-model-name glm-4.7-flash   --max-model-len 202752   --max-num-seqs 64   --host 0.0.0.0 --port 8888   --gpu-memory-utilization 0.7"

To stop:

./launch-cluster.sh -t vllm-node-20260129-tf5 --solo  stop

eugr · January 31, 2026, 6:12pm

I just run everything inside a tmux session. This way I can detach and disconnect and not lose my terminal sessions. For production use, I start my vllm containers via llama-swap.

DannyTup · January 31, 2026, 6:37pm

Ah, got it - that worked, thanks :)

I added > /proc/1/fd/1 2> /proc/1/fd/2 to the end of the command going to docker exec, so now the output shows up in docker logs.

I’m still a bit of a tmux noob 😄 though I like to just have containers for everything and stop/start them (I have a list and buttons on my dashboard). If this model works well, I’ll probably just move what’s in your scripts inside the container so I can just docker run -d it to create it, then docker stop and docker start (via the dashboard) when I want to use it.

Thanks for all your work on this - I had lost many hours trying to get this model to work (with Gemini’s help) previously without much luck 🙃

mjpansa · February 5, 2026, 12:28pm

I am still working on this but using some time to relearn some forgotten pytorch skills / and learning triton to be able to write those missing kernels myself. But its the perfect motivation to invest some time there :)

claytantor · March 25, 2026, 3:57pm

I am running this model using ollama on my (single) ASUS GX10 and trying to configure the model file so that it runs with Claude Code. Is this an issue? I see everyone here using docker containers and serving the model up directly. I use ollama because I am experimenting with model performance first.

Currently my model file is:

FROM glm-4.7-flash:q4_K_M

PARAMETER num_ctx        32768
PARAMETER num_batch      2048
PARAMETER num_gpu        999
PARAMETER num_thread     16
PARAMETER use_mmap       false

PARAMETER temperature    0.2
PARAMETER top_p          0.90
PARAMETER top_k          10
PARAMETER repeat_penalty 1.05
PARAMETER min_p          0.01

RENDERER glm-4.7
PARSER glm-4.7

SYSTEM “”"
You are an expert software engineer operating as an autonomous coding agent.

CRITICAL BEHAVIOR RULES:

When given a task, execute it immediately and completely. Do not ask for confirmation.

Never ask “would you like me to proceed?” — just proceed.

Never ask “what would you like me to do?” — the user already told you.

If a task covers multiple files, work through all of them without stopping to check in.

Only ask a question if the task is genuinely ambiguous and you cannot make a reasonable default choice.

Prefer action over clarification. When in doubt, do the most reasonable thing and explain what you did.
“”"

Topic		Replies	Views
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	4785	February 13, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	27	4496	January 2, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	38	2509	June 24, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	7972	February 24, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2767	December 25, 2025
[Request] GLM-4.7-Flash AWQ/NVFP4 Instructions DGX Spark / GB10 Projects	7	1206	January 26, 2026
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	7038	June 15, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	260	22224	July 15, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	749	June 19, 2026
Step-3.7-Flash on single Spark (llama.cpp only) DGX Spark / GB10 Projects llama	17	1945	June 23, 2026

Make GLM-4.7-Flash go BRRRRR

Related topics