How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker

eugr · December 24, 2025, 6:03am

GLM 4.7 has been released yesterday, and you are able to run it on dual Spark cluster already.

There is only one quant suitable for this setup currently available: Salyut1/GLM-4.7-NVFP4 · Hugging Face

The problem is that due to the way it’s quantized, it uses a model config that is not compatible with existing GLM 4.x parser, so if you just try to run it, it will fail to load.

Fortunately, the author posted a way to patch vLLM by skipping some checks the parser performs.

Due to the nature of this patch disabling checks that may be important for other models, I’ve decided not to include it into the docker build process itself, but introduce a concept of mods - small patches that are applied at launch.

If you don’t want to wait for AWQ quants, and want to try this model, you need to pull the latest changes to GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks, download the model on both nodes, rebuild the container, and run the following command on the head node:

./launch-cluster.sh --apply-mod ./mods/fix-Salyut1-GLM-4.7-NVFP4 \
exec vllm serve Salyut1/GLM-4.7-NVFP4 \
        --attention-config.backend flashinfer \
        --tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.88 \
        --max-model-len 32000 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

UPDATE 12/24: You can now use hf-download.sh script to download and distribute the model across the cluster nodes using ConnectX 7 interconnect. Run on a head node:

./hf-download Salyut1/GLM-4.7-NVFP4 -c

It will autodiscover the nodes. You can add --copy-parallel if you have more than two nodes to saturate the link. For more options, refer to the documentation in the repository.

The script requires uvx to be installed on a host system.

eugr · December 24, 2025, 6:06am

You can skip rebuilding the container if your latest build was from nightly wheels or from the main branch source on December 21 or later.

eugr · December 24, 2025, 8:46am

FYI - first AWQ quant has just popped up: cyankiwi/GLM-4.7-AWQ-4bit
Downloading it now - will compare.

christopher_owen · December 24, 2025, 11:50am

Merry Christmas! :)

eugr · December 24, 2025, 3:54pm

Don’t use cyankiwi/GLM-4.7-AWQ-4bit - it produces some random garbage as an output. So far that NVFP4 model in my original post is the only one that works on dual Sparks (at least in vLLM).

aniculescu · December 24, 2025, 4:07pm

Thanks eugr for updating the community!

eugr · December 24, 2025, 4:12pm

There is a new one, from a very reputable quant provider: QuantTrio/GLM-4.7-AWQ · Hugging Face

This seems to also support MTP, and is smaller in size - I’ll test it later.

eugr · December 24, 2025, 9:25pm

OK, so QuantTrio quant works very well and gives the same performance as their GLM 4.6 quant - I’m getting 16 t/s. I also tried MTP, but while the benchmarks showed some performance boost, it was choppy with speed ups and slow downs.

To run:

Pull the latest version of GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks if you are using it.

Then, download the model on cluster nodes, using the new download script:

./hf-download.sh QuantTrio/GLM-4.7-AWQ -c --copy-parallel

Run the model from the head node:

./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
	--tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 65535 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

To use MTP, you can run:

./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
	--tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 50000 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000
        --speculative-config.method mtp \
        --speculative-config.num_speculative_tokens 1

Some benchmarks:

Without MTP

vllm serve QuantTrio/GLM-4.7-AWQ \
		--tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.88 \
        --max-model-len 32000 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8888

vllm bench serve   --backend vllm   --model QuantTrio/GLM-4.7-AWQ   --endpoint /v1/completions   --dataset-name sharegpt   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json   --num-prompts 1   --port 8888 --host spark

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  7.80
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.13
Output token throughput (tok/s):         15.25
Peak output token throughput (tok/s):    16.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          16.79
---------------Time to First Token----------------
Mean TTFT (ms):                          249.52
Median TTFT (ms):                        249.52
P99 TTFT (ms):                           249.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          64.02
Median TPOT (ms):                        64.02
P99 TPOT (ms):                           64.02
---------------Inter-token Latency----------------
Mean ITL (ms):                           64.02
Median ITL (ms):                         62.22
P99 ITL (ms):                            75.49
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  88.66
Total input tokens:                      1371
Total generated tokens:                  2453
Request throughput (req/s):              0.11
Output token throughput (tok/s):         27.67
Peak output token throughput (tok/s):    49.00
Peak concurrent requests:                10.00
Total token throughput (tok/s):          43.13
---------------Time to First Token----------------
Mean TTFT (ms):                          2765.16
Median TTFT (ms):                        3035.13
P99 TTFT (ms):                           3036.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          189.91
Median TPOT (ms):                        171.51
P99 TPOT (ms):                           413.85
---------------Inter-token Latency----------------
Mean ITL (ms):                           132.04
Median ITL (ms):                         121.88
P99 ITL (ms):                            217.29
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  252.77
Total input tokens:                      22992
Total generated tokens:                  20942
Request throughput (req/s):              0.40
Output token throughput (tok/s):         82.85
Peak output token throughput (tok/s):    222.00
Peak concurrent requests:                100.00
Total token throughput (tok/s):          173.81
---------------Time to First Token----------------
Mean TTFT (ms):                          9816.27
Median TTFT (ms):                        10136.62
P99 TTFT (ms):                           18330.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          578.08
Median TPOT (ms):                        480.05
P99 TPOT (ms):                           1679.23
---------------Inter-token Latency----------------
Mean ITL (ms):                           421.64
Median ITL (ms):                         417.24
P99 ITL (ms):                            1672.10
==================================================

With MTP

vllm serve QuantTrio/GLM-4.7-AWQ \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--enable-auto-tool-choice \
	-tp 2 \
	--gpu-memory-utilization 0.88 \
	--max-model-len 32000 \
	--distributed-executor-backend ray \
	--host 0.0.0.0 \
	--port 8888 --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  5.62
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.18
Output token throughput (tok/s):         21.17
Peak output token throughput (tok/s):    12.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          23.31
---------------Time to First Token----------------
Mean TTFT (ms):                          249.01
Median TTFT (ms):                        249.01
P99 TTFT (ms):                           249.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.52
Median TPOT (ms):                        45.52
P99 TPOT (ms):                           45.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           83.92
Median ITL (ms):                         83.74
P99 ITL (ms):                            97.95
==================================================

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  67.54
Total input tokens:                      1371
Total generated tokens:                  2526
Request throughput (req/s):              0.15
Output token throughput (tok/s):         37.40
Peak output token throughput (tok/s):    35.00
Peak concurrent requests:                10.00
Total token throughput (tok/s):          57.70
---------------Time to First Token----------------
Mean TTFT (ms):                          2335.06
Median TTFT (ms):                        2563.09
P99 TTFT (ms):                           2564.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          116.90
Median TPOT (ms):                        127.24
P99 TPOT (ms):                           148.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           189.11
Median ITL (ms):                         180.69
P99 ITL (ms):                            268.93
==================================================

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  334.69
Total input tokens:                      22992
Total generated tokens:                  3757
Request throughput (req/s):              0.30
Output token throughput (tok/s):         11.23
Peak output token throughput (tok/s):    157.00
Peak concurrent requests:                100.00
Total token throughput (tok/s):          79.92
---------------Time to First Token----------------
Mean TTFT (ms):                          11175.71
Median TTFT (ms):                        11340.99
P99 TTFT (ms):                           19969.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          524.87
Median TPOT (ms):                        476.19
P99 TPOT (ms):                           1255.18
---------------Inter-token Latency----------------
Mean ITL (ms):                           848.31
Median ITL (ms):                         598.62
P99 ITL (ms):                            1746.27
==================================================

Server crashed after serving 100 requests.

haegler · December 28, 2025, 2:35pm

Hi @eugr , I would like to pay you a compliment on this and your efforts. The setup works great on our two DGX Sparks—see, for example, the non-MTP performance below (a simulated “real live” test based on feeding the model lots of Python scripts and letting it analyze them).

## 📊 Benchmark results summary



**Model:** `/srv/models/models/GLM-4.7-AWQ`

**Timestamp:** `2025-12-28 12:20:18`



| Input Tokens (target) | Input Tokens (actual) | Output Tokens | TTFT (s) | Gen Time (s) | Total Time (s) | Gen Speed (tok/s) | Overall (tok/s) |

|---:|---:|---:|---:|---:|---:|---:|---:|

| 10 | 66 | 1,024 | 0.530 | 65.11 | 65.64 | 15.7 | 15.6 |

| 100 | 158 | 1,024 | 0.627 | 64.92 | 65.55 | 15.8 | 15.6 |

| 1,000 | 1,063 | 1,024 | 1.654 | 65.86 | 67.52 | 15.5 | 15.2 |

| 5,000 | 5,059 | 1,023 | 6.507 | 69.74 | 76.25 | 14.7 | 13.4 |

| 20,000 | 20,065 | 1,021 | 24.227 | 82.02 | 106.25 | 12.4 | 9.6 |

| 50,000 | 50,064 | 1,024 | 81.901 | 107.21 | 189.11 | 9.6 | 5.4 |



## 📈 Summary statistics



- **Avg TTFT:** 19.241s

- **Avg Generation Speed:** 14.0 tok/s

- **Min/Max TTFT:** 0.530s / 81.901s

- **Min/Max Gen Speed:** 9.6 / 15.8 tok/s

- **Total output tokens:** 6,140

- **Successful runs:** 6/6

eugr · December 29, 2025, 5:54pm

Thanks, this is great! What did you use for your benchmarks?

eugr · December 29, 2025, 5:59pm

BTW, if your client software can’t reliably manage thinking tags from GLM 4.7 (or 4.6) models, which is probably the case with most of the clients, try using deepseek_r1 as a reasoning parser, so the launch command would be:

./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
	--tool-call-parser glm47 \
        --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 65535 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

Also, MTP seems to significantly boost performance in coding workflows, so I recommend enabling it.

eugr · December 29, 2025, 6:40pm

My working configuration with MTP and 128K context window (actually, you can fit up to 140K if you drop caches), but you need to use fp8 quant for kv cache:

./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
        --tool-call-parser glm47 \
        --reasoning-parser deepseek_r1 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 128000 \
        --kv-cache-dtype fp8 \
        --speculative-config.method mtp \
        --speculative-config.num_speculative_tokens 1 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

kebab · December 29, 2025, 9:07pm

There’s a fix that just got merged in VLLM to support fix the tool-call-parser for GLM-4.7, it was broken before.

github.com/vllm-project/vllm

[Bugfix] Preserve tool call id/type/name in streaming finish chunk

main ← amittell:fix/streaming-tool-call-missing-fields

opened 11:55PM - 27 Dec 25 UTC

amittell

+170 -9

## Summary When streaming tool calls, the finish chunk code in `serving_chat.py…` overwrites the tool parser's `DeltaMessage` with a stripped-down version containing only `index` and `function.arguments`. This loses the `id`, `type`, and `function.name` fields. This fix preserves those fields from the original `delta_message` when building the finish chunk. ## Issue Fixes #31437 ## What was wrong ```python delta_message = DeltaMessage( tool_calls=[ DeltaToolCall( index=index, function=DeltaFunctionCall( arguments=remaining_call ).model_dump(exclude_none=True), ) ] ) ``` The `id`, `type`, and `function.name` fields were lost. ## The fix ```python original_tc = delta_message.tool_calls[0] original_fn = original_tc.function if original_tc else None delta_message = DeltaMessage( tool_calls=[ DeltaToolCall( index=index, id=original_tc.id if original_tc else None, type=original_tc.type if original_tc else None, function=DeltaFunctionCall( name=original_fn.name if original_fn else None, arguments=remaining_call, ), ) ] ) ``` ## Testing Tested with GLM-4.7-AWQ using `--tool-call-parser glm47`. Before fix: ~20% of streaming tool call responses had `id=None`. After fix: 100% success rate. ## Related - #16340 - Similar symptoms, different root cause - #10781 - Mentions delta not being submitted correctly

eugr · December 29, 2025, 10:07pm

Yep, it fixed tool calling issues, except for one - when there is a tool call without any parameters.

starkrun · December 30, 2025, 7:54am

That’s interesting. I have a DGX Spark and am considering buying a 2nd one.

A single Spark has 273GB/s memory bandwidth allegedly (spec sheet). In practice, in a local benchmark, I observed 110GB/s while copying a 4GB tensor.
You add a 2nd Spark via a 200Gbps QSFP cable, that gives you just under 25GB/s between the two

I’m curious what the “penalty hit” is for distributing like this relative to a theoretical Spark with 256GB VRAM. Any chance you could benchmark a MOE model that fits entirely on 1 Spark (something around 90GB), vs distributing it across 2 Sparks?

eugr · December 30, 2025, 8:04am

Latency plays bigger role than interconnect speed here. I get almost 2x inference speed for large dense models, but smaller the number of active parameters, less the gain. Here is the table I compiled about a month ago:

Model name	Cluster (t/s)	Single (t/s)	Comment
Qwen/Qwen3-VL-32B-Instruct-FP8	12.00	7.00
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit	21.00	12.00
GPT-OSS-120B	55.00	36.00	SGLang gives 75/53
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4	21.00	N/A
QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ	26.00	N/A
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8	65.00	52.00
QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ	97.00	82.00
RedHatAI/Qwen3-30B-A3B-NVFP4	75.00	64.00
QuantTrio/MiniMax-M2-AWQ	41.00	N/A
QuantTrio/GLM-4.6-AWQ	17.00	N/A
zai-org/GLM-4.6V-FP8	24.00	N/A

This is all using vLLM (and additionally SGLang for gpt-oss-120b) with tensor-parallel=2 over NCCL/RDMA.

starkrun · December 30, 2025, 11:52am

Thanks for the benchmarks, very informative!

I would have been curious to see you go higher on maxing out a single Spark with your model choices. A typical home user is basically always trying to fill out 99% of their VRAM on their device with the biggest model possible, so they’re not going to be running Qwen3-30B-A3B. Of those, GPT-OSS-120B would be the best choice for a user, but it kinda messes up benchmarks by being natively FP4.

If you have these models lying around quantized enough to fit on 1 Spark, and don’t mind running a test, I would be curious to know how they fare in 1 vs 2 Spark:

GLM-4.5-Air (106B A12B)
A large dense models like Llama3 70B. I know dense models are out of fashion in 2025 (except for ByteDance’s Seed OSS 36B), but that would HAVE to run slower on 2 Sparks, surely?

tbraun96 · December 30, 2025, 2:01pm

@eugr How tenable is GLM 4.7 at general coding? Do you think it can replace Claude Sonnet 4.5?

eugr · December 30, 2025, 6:36pm

Normally, yes, especially with MoE, but sometimes you just need more speed. For some workloads I run multiple models on the same Spark - embedding model, fast multi-modal model, “thinking” model.

Besides being able to use larger models, two sparks is the way for me to speed up models that otherwise fit on a single Spark, but still large enough to overcome latency bottleneck.

I have an AWQ quant of this, can run benchmarks later today.

Not really. If anything, larger dense model will benefit the most from running on two Sparks. I don’t have 70B model handy, but you can look at Qwen3-32B-FP8 results in my table above. At FP8 it takes about the same size as 64B parameter dense model at 4-bit quant.

Basically, the way tensor parallelism works is that it splits weights between workers, so layers can be processed in parallel, and performs all_reduce operation to aggregate partial results and update the workers. It takes very little bandwidth, but the entire cluster has to finish this before processing the next layer, so the latency is very important here.

In a small model or a MoE model with relatively small number of active parameters, the matrix ops on partial weights are much quicker, so all_reduce operation becomes a bottleneck that slows down overall inference - so smaller the model, bigger the penalty.

Larger the model, more time it takes to process a single layer, so all_reduce becomes less of a bottleneck here. Basically, larger the model, closer you get to 2x gain from using two Sparks.

eugr · December 30, 2025, 6:38pm

In my limited testing (mostly with Python code), it’s pretty close to it. Better than MiniMax M2, not sure about MiniMax M2.1 yet.

Topic		Replies	Views
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	5	775	January 3, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	852	December 31, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	959	December 25, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	2778	December 9, 2025
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	578	December 7, 2025
Dgx spark benchmark performance DGX Spark / GB10	17	1231	January 4, 2026
6x Spark setup DGX Spark / GB10	34	1512	January 10, 2026
GLM4.6V NVFP4 existing? DGX Spark / GB10	6	240	January 8, 2026
DGX Spark performance DGX Spark / GB10	12	372	January 9, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	18	1440	December 4, 2025

How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker

Without MTP

With MTP

Related topics