GLM 4.7 has been released yesterday, and you are able to run it on dual Spark cluster already. There is only one quant suitable for this setup currently available: Salyut1/GLM-4.7-NVFP4 · Hugging Face The problem is that due to the way it’s quantized, it uses a model config that is not compatible …

Thanks eugr for updating the community!

You can skip rebuilding the container if your latest build was from nightly wheels or from the main branch source on December 21 or later.

FYI - first AWQ quant has just popped up: cyankiwi/GLM-4.7-AWQ-4bit Downloading it now - will compare.

There is a new one, from a very reputable quant provider: QuantTrio/GLM-4.7-AWQ · Hugging Face This seems to also support MTP, and is smaller in size - I’ll test it later.

OK, so QuantTrio quant works very well and gives the same performance as their GLM 4.6 quant - I’m getting 16 t/s. I also tried MTP, but while the benchmarks showed some performance boost, it was choppy with speed ups and slow downs. To run: Pull the latest version of GitHub - eugr/spark-vllm-dock…

Hi @eugr , I would like to pay you a compliment on this and your efforts. The setup works great on our two DGX Sparks—see, for example, the non-MTP performance below (a simulated “real live” test based on feeding the model lots of Python scripts and letting it analyze them). ## 📊 Benchmark results…

Thanks, this is great! What did you use for your benchmarks?

BTW, if your client software can’t reliably manage thinking tags from GLM 4.7 (or 4.6) models, which is probably the case with most of the clients, try using deepseek_r1 as a reasoning parser, so the launch command would be: ./launch-cluster.sh exec \ vllm serve QuantTrio/GLM-4.7-AWQ \ --tool-call…

My working configuration with MTP and 128K context window (actually, you can fit up to 140K if you drop caches), but you need to use fp8 quant for kv cache: ./launch-cluster.sh exec \ vllm serve QuantTrio/GLM-4.7-AWQ \ --tool-call-parser glm47 \ --reasoning-parser deepseek_r1 \ …

How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

eugr December 24, 2025, 3:54pm 5

Don’t use cyankiwi/GLM-4.7-AWQ-4bit - it produces some random garbage as an output. So far that NVFP4 model in my original post is the only one that works on dual Sparks (at least in vLLM).

1 Like

Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

Topic		Replies	Views
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	1961	January 22, 2026
DGX Spark performance DGX Spark / GB10	50	3747	February 27, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1088	February 13, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	33	1288	March 26, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2686	December 31, 2025
6x Spark setup DGX Spark / GB10	109	7410	April 1, 2026
Make GLM-4.7-Flash go BRRRRR DGX Spark / GB10	18	2089	March 25, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4120	February 27, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2202	December 25, 2025
Value of 2nd Spark? DGX Spark / GB10 Projects	21	1384	March 30, 2026

How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker

Related topics