How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker

GLM 4.7 has been released yesterday, and you are able to run it on dual Spark cluster already.

There is only one quant suitable for this setup currently available: Salyut1/GLM-4.7-NVFP4 · Hugging Face

The problem is that due to the way it’s quantized, it uses a model config that is not compatible with existing GLM 4.x parser, so if you just try to run it, it will fail to load.

Fortunately, the author posted a way to patch vLLM by skipping some checks the parser performs.

Due to the nature of this patch disabling checks that may be important for other models, I’ve decided not to include it into the docker build process itself, but introduce a concept of mods - small patches that are applied at launch.

If you don’t want to wait for AWQ quants, and want to try this model, you need to pull the latest changes to GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks, download the model on both nodes, rebuild the container, and run the following command on the head node:

./launch-cluster.sh --apply-mod ./mods/fix-Salyut1-GLM-4.7-NVFP4 \
exec vllm serve Salyut1/GLM-4.7-NVFP4 \
        --attention-config.backend flashinfer \
        --tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.88 \
        --max-model-len 32000 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

UPDATE 12/24: You can now use hf-download.sh script to download and distribute the model across the cluster nodes using ConnectX 7 interconnect. Run on a head node:

./hf-download Salyut1/GLM-4.7-NVFP4 -c

It will autodiscover the nodes. You can add --copy-parallel if you have more than two nodes to saturate the link. For more options, refer to the documentation in the repository.

The script requires uvx to be installed on a host system.

6 Likes

You can skip rebuilding the container if your latest build was from nightly wheels or from the main branch source on December 21 or later.

2 Likes

FYI - first AWQ quant has just popped up: cyankiwi/GLM-4.7-AWQ-4bit
Downloading it now - will compare.

Merry Christmas! :)

1 Like

Don’t use cyankiwi/GLM-4.7-AWQ-4bit - it produces some random garbage as an output. So far that NVFP4 model in my original post is the only one that works on dual Sparks (at least in vLLM).

1 Like

Thanks eugr for updating the community!

1 Like

There is a new one, from a very reputable quant provider: QuantTrio/GLM-4.7-AWQ · Hugging Face

This seems to also support MTP, and is smaller in size - I’ll test it later.

OK, so QuantTrio quant works very well and gives the same performance as their GLM 4.6 quant - I’m getting 16 t/s. I also tried MTP, but while the benchmarks showed some performance boost, it was choppy with speed ups and slow downs.

To run:

Pull the latest version of GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks if you are using it.

Then, download the model on cluster nodes, using the new download script:

./hf-download.sh QuantTrio/GLM-4.7-AWQ -c --copy-parallel

Run the model from the head node:

./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
	--tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 65535 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

To use MTP, you can run:

./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
	--tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 50000 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000
        --speculative-config.method mtp \
        --speculative-config.num_speculative_tokens 1

Some benchmarks:

Without MTP
vllm serve QuantTrio/GLM-4.7-AWQ \
		--tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.88 \
        --max-model-len 32000 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8888
vllm bench serve   --backend vllm   --model QuantTrio/GLM-4.7-AWQ   --endpoint /v1/completions   --dataset-name sharegpt   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json   --num-prompts 1   --port 8888 --host spark
============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  7.80
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.13
Output token throughput (tok/s):         15.25
Peak output token throughput (tok/s):    16.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          16.79
---------------Time to First Token----------------
Mean TTFT (ms):                          249.52
Median TTFT (ms):                        249.52
P99 TTFT (ms):                           249.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.02
Median TPOT (ms):                        64.02
P99 TPOT (ms):                           64.02
---------------Inter-token Latency----------------
Mean ITL (ms):                           64.02
Median ITL (ms):                         62.22
P99 ITL (ms):                            75.49
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  88.66
Total input tokens:                      1371
Total generated tokens:                  2453
Request throughput (req/s):              0.11
Output token throughput (tok/s):         27.67
Peak output token throughput (tok/s):    49.00
Peak concurrent requests:                10.00
Total token throughput (tok/s):          43.13
---------------Time to First Token----------------
Mean TTFT (ms):                          2765.16
Median TTFT (ms):                        3035.13
P99 TTFT (ms):                           3036.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          189.91
Median TPOT (ms):                        171.51
P99 TPOT (ms):                           413.85
---------------Inter-token Latency----------------
Mean ITL (ms):                           132.04
Median ITL (ms):                         121.88
P99 ITL (ms):                            217.29
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  252.77
Total input tokens:                      22992
Total generated tokens:                  20942
Request throughput (req/s):              0.40
Output token throughput (tok/s):         82.85
Peak output token throughput (tok/s):    222.00
Peak concurrent requests:                100.00
Total token throughput (tok/s):          173.81
---------------Time to First Token----------------
Mean TTFT (ms):                          9816.27
Median TTFT (ms):                        10136.62
P99 TTFT (ms):                           18330.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          578.08
Median TPOT (ms):                        480.05
P99 TPOT (ms):                           1679.23
---------------Inter-token Latency----------------
Mean ITL (ms):                           421.64
Median ITL (ms):                         417.24
P99 ITL (ms):                            1672.10
==================================================
With MTP
vllm serve QuantTrio/GLM-4.7-AWQ \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--enable-auto-tool-choice \
	-tp 2 \
	--gpu-memory-utilization 0.88 \
	--max-model-len 32000 \
	--distributed-executor-backend ray \
	--host 0.0.0.0 \
	--port 8888 --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1
============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  5.62
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.18
Output token throughput (tok/s):         21.17
Peak output token throughput (tok/s):    12.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          23.31
---------------Time to First Token----------------
Mean TTFT (ms):                          249.01
Median TTFT (ms):                        249.01
P99 TTFT (ms):                           249.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.52
Median TPOT (ms):                        45.52
P99 TPOT (ms):                           45.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           83.92
Median ITL (ms):                         83.74
P99 ITL (ms):                            97.95
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  67.54
Total input tokens:                      1371
Total generated tokens:                  2526
Request throughput (req/s):              0.15
Output token throughput (tok/s):         37.40
Peak output token throughput (tok/s):    35.00
Peak concurrent requests:                10.00
Total token throughput (tok/s):          57.70
---------------Time to First Token----------------
Mean TTFT (ms):                          2335.06
Median TTFT (ms):                        2563.09
P99 TTFT (ms):                           2564.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          116.90
Median TPOT (ms):                        127.24
P99 TPOT (ms):                           148.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           189.11
Median ITL (ms):                         180.69
P99 ITL (ms):                            268.93
==================================================
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  334.69
Total input tokens:                      22992
Total generated tokens:                  3757
Request throughput (req/s):              0.30
Output token throughput (tok/s):         11.23
Peak output token throughput (tok/s):    157.00
Peak concurrent requests:                100.00
Total token throughput (tok/s):          79.92
---------------Time to First Token----------------
Mean TTFT (ms):                          11175.71
Median TTFT (ms):                        11340.99
P99 TTFT (ms):                           19969.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          524.87
Median TPOT (ms):                        476.19
P99 TPOT (ms):                           1255.18
---------------Inter-token Latency----------------
Mean ITL (ms):                           848.31
Median ITL (ms):                         598.62
P99 ITL (ms):                            1746.27
==================================================

Server crashed after serving 100 requests.

4 Likes

Hi @eugr , I would like to pay you a compliment on this and your efforts. The setup works great on our two DGX Sparks—see, for example, the non-MTP performance below (a simulated “real live” test based on feeding the model lots of Python scripts and letting it analyze them).

## 📊 Benchmark results summary



**Model:** `/srv/models/models/GLM-4.7-AWQ`

**Timestamp:** `2025-12-28 12:20:18`



| Input Tokens (target) | Input Tokens (actual) | Output Tokens | TTFT (s) | Gen Time (s) | Total Time (s) | Gen Speed (tok/s) | Overall (tok/s) |

|---:|---:|---:|---:|---:|---:|---:|---:|

| 10 | 66 | 1,024 | 0.530 | 65.11 | 65.64 | 15.7 | 15.6 |

| 100 | 158 | 1,024 | 0.627 | 64.92 | 65.55 | 15.8 | 15.6 |

| 1,000 | 1,063 | 1,024 | 1.654 | 65.86 | 67.52 | 15.5 | 15.2 |

| 5,000 | 5,059 | 1,023 | 6.507 | 69.74 | 76.25 | 14.7 | 13.4 |

| 20,000 | 20,065 | 1,021 | 24.227 | 82.02 | 106.25 | 12.4 | 9.6 |

| 50,000 | 50,064 | 1,024 | 81.901 | 107.21 | 189.11 | 9.6 | 5.4 |



## 📈 Summary statistics



- **Avg TTFT:** 19.241s

- **Avg Generation Speed:** 14.0 tok/s

- **Min/Max TTFT:** 0.530s / 81.901s

- **Min/Max Gen Speed:** 9.6 / 15.8 tok/s

- **Total output tokens:** 6,140

- **Successful runs:** 6/6

2 Likes

Thanks, this is great! What did you use for your benchmarks?

BTW, if your client software can’t reliably manage thinking tags from GLM 4.7 (or 4.6) models, which is probably the case with most of the clients, try using deepseek_r1 as a reasoning parser, so the launch command would be:

./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
	--tool-call-parser glm47 \
        --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 65535 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

Also, MTP seems to significantly boost performance in coding workflows, so I recommend enabling it.

2 Likes

My working configuration with MTP and 128K context window (actually, you can fit up to 140K if you drop caches), but you need to use fp8 quant for kv cache:

./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
        --tool-call-parser glm47 \
        --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 128000 \
        --kv-cache-dtype fp8 \
        --speculative-config.method mtp \
        --speculative-config.num_speculative_tokens 1 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

There’s a fix that just got merged in VLLM to support fix the tool-call-parser for GLM-4.7, it was broken before.

1 Like

Yep, it fixed tool calling issues, except for one - when there is a tool call without any parameters.

That’s interesting. I have a DGX Spark and am considering buying a 2nd one.

  • A single Spark has 273GB/s memory bandwidth allegedly (spec sheet). In practice, in a local benchmark, I observed 110GB/s while copying a 4GB tensor.
  • You add a 2nd Spark via a 200Gbps QSFP cable, that gives you just under 25GB/s between the two

I’m curious what the “penalty hit” is for distributing like this relative to a theoretical Spark with 256GB VRAM. Any chance you could benchmark a MOE model that fits entirely on 1 Spark (something around 90GB), vs distributing it across 2 Sparks?

Latency plays bigger role than interconnect speed here. I get almost 2x inference speed for large dense models, but smaller the number of active parameters, less the gain. Here is the table I compiled about a month ago:

Model name Cluster (t/s) Single (t/s) Comment
Qwen/Qwen3-VL-32B-Instruct-FP8 12.00 7.00
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit 21.00 12.00
GPT-OSS-120B 55.00 36.00 SGLang gives 75/53
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 21.00 N/A
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ 26.00 N/A
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 65.00 52.00
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ 97.00 82.00
RedHatAI/Qwen3-30B-A3B-NVFP4 75.00 64.00
QuantTrio/MiniMax-M2-AWQ 41.00 N/A
QuantTrio/GLM-4.6-AWQ 17.00 N/A
zai-org/GLM-4.6V-FP8 24.00 N/A

This is all using vLLM (and additionally SGLang for gpt-oss-120b) with tensor-parallel=2 over NCCL/RDMA.

2 Likes

Thanks for the benchmarks, very informative!

I would have been curious to see you go higher on maxing out a single Spark with your model choices. A typical home user is basically always trying to fill out 99% of their VRAM on their device with the biggest model possible, so they’re not going to be running Qwen3-30B-A3B. Of those, GPT-OSS-120B would be the best choice for a user, but it kinda messes up benchmarks by being natively FP4.

If you have these models lying around quantized enough to fit on 1 Spark, and don’t mind running a test, I would be curious to know how they fare in 1 vs 2 Spark:

  1. GLM-4.5-Air (106B A12B)
  2. A large dense models like Llama3 70B. I know dense models are out of fashion in 2025 (except for ByteDance’s Seed OSS 36B), but that would HAVE to run slower on 2 Sparks, surely?

@eugr How tenable is GLM 4.7 at general coding? Do you think it can replace Claude Sonnet 4.5?

Normally, yes, especially with MoE, but sometimes you just need more speed. For some workloads I run multiple models on the same Spark - embedding model, fast multi-modal model, “thinking” model.

Besides being able to use larger models, two sparks is the way for me to speed up models that otherwise fit on a single Spark, but still large enough to overcome latency bottleneck.

I have an AWQ quant of this, can run benchmarks later today.

Not really. If anything, larger dense model will benefit the most from running on two Sparks. I don’t have 70B model handy, but you can look at Qwen3-32B-FP8 results in my table above. At FP8 it takes about the same size as 64B parameter dense model at 4-bit quant.

Basically, the way tensor parallelism works is that it splits weights between workers, so layers can be processed in parallel, and performs all_reduce operation to aggregate partial results and update the workers. It takes very little bandwidth, but the entire cluster has to finish this before processing the next layer, so the latency is very important here.

In a small model or a MoE model with relatively small number of active parameters, the matrix ops on partial weights are much quicker, so all_reduce operation becomes a bottleneck that slows down overall inference - so smaller the model, bigger the penalty.

Larger the model, more time it takes to process a single layer, so all_reduce becomes less of a bottleneck here. Basically, larger the model, closer you get to 2x gain from using two Sparks.

1 Like

In my limited testing (mostly with Python code), it’s pretty close to it. Better than MiniMax M2, not sure about MiniMax M2.1 yet.

2 Likes