OK, so QuantTrio quant works very well and gives the same performance as their GLM 4.6 quant - I’m getting 16 t/s. I also tried MTP, but while the benchmarks showed some performance boost, it was choppy with speed ups and slow downs.
To run:
Pull the latest version of GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks if you are using it.
Then, download the model on cluster nodes, using the new download script:
./hf-download.sh QuantTrio/GLM-4.7-AWQ -c --copy-parallel
Run the model from the head node:
./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
-tp 2 \
--gpu-memory-utilization 0.9 \
--max-model-len 65535 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8000
To use MTP, you can run:
./launch-cluster.sh exec \
vllm serve QuantTrio/GLM-4.7-AWQ \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
-tp 2 \
--gpu-memory-utilization 0.9 \
--max-model-len 50000 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8000
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1
Some benchmarks:
Without MTP
vllm serve QuantTrio/GLM-4.7-AWQ \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
-tp 2 \
--gpu-memory-utilization 0.88 \
--max-model-len 32000 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8888
vllm bench serve --backend vllm --model QuantTrio/GLM-4.7-AWQ --endpoint /v1/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1 --port 8888 --host spark
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 7.80
Total input tokens: 12
Total generated tokens: 119
Request throughput (req/s): 0.13
Output token throughput (tok/s): 15.25
Peak output token throughput (tok/s): 16.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 16.79
---------------Time to First Token----------------
Mean TTFT (ms): 249.52
Median TTFT (ms): 249.52
P99 TTFT (ms): 249.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 64.02
Median TPOT (ms): 64.02
P99 TPOT (ms): 64.02
---------------Inter-token Latency----------------
Mean ITL (ms): 64.02
Median ITL (ms): 62.22
P99 ITL (ms): 75.49
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 88.66
Total input tokens: 1371
Total generated tokens: 2453
Request throughput (req/s): 0.11
Output token throughput (tok/s): 27.67
Peak output token throughput (tok/s): 49.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 43.13
---------------Time to First Token----------------
Mean TTFT (ms): 2765.16
Median TTFT (ms): 3035.13
P99 TTFT (ms): 3036.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 189.91
Median TPOT (ms): 171.51
P99 TPOT (ms): 413.85
---------------Inter-token Latency----------------
Mean ITL (ms): 132.04
Median ITL (ms): 121.88
P99 ITL (ms): 217.29
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 252.77
Total input tokens: 22992
Total generated tokens: 20942
Request throughput (req/s): 0.40
Output token throughput (tok/s): 82.85
Peak output token throughput (tok/s): 222.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 173.81
---------------Time to First Token----------------
Mean TTFT (ms): 9816.27
Median TTFT (ms): 10136.62
P99 TTFT (ms): 18330.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 578.08
Median TPOT (ms): 480.05
P99 TPOT (ms): 1679.23
---------------Inter-token Latency----------------
Mean ITL (ms): 421.64
Median ITL (ms): 417.24
P99 ITL (ms): 1672.10
==================================================
With MTP
vllm serve QuantTrio/GLM-4.7-AWQ \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
-tp 2 \
--gpu-memory-utilization 0.88 \
--max-model-len 32000 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8888 --speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 5.62
Total input tokens: 12
Total generated tokens: 119
Request throughput (req/s): 0.18
Output token throughput (tok/s): 21.17
Peak output token throughput (tok/s): 12.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 23.31
---------------Time to First Token----------------
Mean TTFT (ms): 249.01
Median TTFT (ms): 249.01
P99 TTFT (ms): 249.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.52
Median TPOT (ms): 45.52
P99 TPOT (ms): 45.52
---------------Inter-token Latency----------------
Mean ITL (ms): 83.92
Median ITL (ms): 83.74
P99 ITL (ms): 97.95
==================================================
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 67.54
Total input tokens: 1371
Total generated tokens: 2526
Request throughput (req/s): 0.15
Output token throughput (tok/s): 37.40
Peak output token throughput (tok/s): 35.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 57.70
---------------Time to First Token----------------
Mean TTFT (ms): 2335.06
Median TTFT (ms): 2563.09
P99 TTFT (ms): 2564.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 116.90
Median TPOT (ms): 127.24
P99 TPOT (ms): 148.89
---------------Inter-token Latency----------------
Mean ITL (ms): 189.11
Median ITL (ms): 180.69
P99 ITL (ms): 268.93
==================================================
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 334.69
Total input tokens: 22992
Total generated tokens: 3757
Request throughput (req/s): 0.30
Output token throughput (tok/s): 11.23
Peak output token throughput (tok/s): 157.00
Peak concurrent requests: 100.00
Total token throughput (tok/s): 79.92
---------------Time to First Token----------------
Mean TTFT (ms): 11175.71
Median TTFT (ms): 11340.99
P99 TTFT (ms): 19969.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 524.87
Median TPOT (ms): 476.19
P99 TPOT (ms): 1255.18
---------------Inter-token Latency----------------
Mean ITL (ms): 848.31
Median ITL (ms): 598.62
P99 ITL (ms): 1746.27
==================================================
Server crashed after serving 100 requests.