To run on a single Spark, you will need a 4-bit quant - those will be coming soon.
But you can run FP8 version on dual sparks and get 22 t/s. .
You can use my Docker build at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
However, you will need to perform extra actions inside the container in order to run this model. This is a first model to require the newest version of Transformers library - v5. It is still in release candidate phase, and there are some issues currently, so Iām not going to make it default in my build yet.
To run the model, youāll have to enter running container on both nodes and run this command prior to launching the model:
pip install transformers>=5.0.0 --pre -U
Then you can launch the model on the head container using this command:
vllm serve zai-org/GLM-4.6V-FP8 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--allowed-local-media-path / \
--mm-encoder-tp-mode data \
-tp 2 \
--gpu-memory-utilization 0.7 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8888
Adjust the parameters as needed. Note, fastsafetensors works, but my vllm just froze during inference when I was benchmarking 100 requests, so I recommend not using it for now.
Some benchmarks:
vllm bench serve \
--backend vllm \
--model zai-org/GLM-4.6V-FP8 \
--endpoint /v1/completions \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--port 8888 \
--host spark \
--num-prompts 1
Single request:
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 5.18
Total input tokens: 12
Total generated tokens: 119
Request throughput (req/s): 0.19
Output token throughput (tok/s): 22.98
Peak output token throughput (tok/s): 24.00
Peak concurrent requests: 1.00
Total Token throughput (tok/s): 25.30
---------------Time to First Token----------------
Mean TTFT (ms): 163.40
Median TTFT (ms): 163.40
P99 TTFT (ms): 163.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 42.50
Median TPOT (ms): 42.50
P99 TPOT (ms): 42.50
---------------Inter-token Latency----------------
Mean ITL (ms): 42.50
Median ITL (ms): 42.06
P99 ITL (ms): 52.69
==================================================
10 requests:
Output tokens per second
80 +----------------------------------------------------------------------+
| |
70 | * * * |
| * ** * |
| ***** ** |
60 | * * * * |
| *** * * * |
50 |* ** * |
|* ** * * |
40 |* ** ****** ** |
|* ** *** ***** *** * |
|* * * * ***************** *** |
30 | * |
| * |
20 | * |
| * |
| * |
10 | * |
| * |
0 +----------------------------------------------------------------------+
0 10 20 30 40 50 60 70
Concurrent requests per second
10 +----------------------------------------------------------------------+
| * |
| ** |
| * |
8 | ******* |
| * |
| *** |
| * |
6 | ** |
| ********** |
| * |
4 | *** |
| * |
| ************** |
| * |
2 | ********************* |
| * |
| * |
| * |
0 +----------------------------------------------------------------------+
0 10 20 30 40 50 60 70
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 64.80
Total input tokens: 1371
Total generated tokens: 2654
Request throughput (req/s): 0.15
Output token throughput (tok/s): 40.96
Peak output token throughput (tok/s): 72.00
Peak concurrent requests: 10.00
Total Token throughput (tok/s): 62.11
---------------Time to First Token----------------
Mean TTFT (ms): 890.60
Median TTFT (ms): 970.14
P99 TTFT (ms): 971.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 128.82
Median TPOT (ms): 133.36
P99 TPOT (ms): 172.51
---------------Inter-token Latency----------------
Mean ITL (ms): 101.23
Median ITL (ms): 94.25
P99 ITL (ms): 183.08
==================================================
100 requests:
Output tokens per second
400 +---------------------------------------------------------------------+
| * |
350 | * |
| * * |
| * *** |
300 | * ** * ** |
| * * * ** * * * |
250 | * ** *** * * * * * ** * |
| * * ** * * *** * * * * * * * * *** |
200 | * * * * ** * * ** * *** *** * * * ** * |
| * * * * * * * * * * **** * ** * |
| * * * * ** *** |
150 | * * * *** |
| * * |
100 | * * |
| * ***** * |
| **** ** * |
50 | * * * * |
|**** * * |
0 +---------------------------------------------------------------------+
0 10 20 30 40 50 60
Concurrent requests per second
100 +---------------------------------------------------------------------+
| * |
| *** |
| * |
80 | ** |
| * |
| ********** |
| *********** |
60 | *********** |
| ******** |
| ***** |
40 | * |
| * |
| * |
| * |
20 | * |
| * |
| * |
| * |
0 +---------------------------------------------------------------------+
0 10 20 30 40 50 60
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Benchmark duration (s): 357.56
Total input tokens: 22992
Total generated tokens: 10842
Request throughput (req/s): 0.28
Output token throughput (tok/s): 30.32
Peak output token throughput (tok/s): 370.00
Peak concurrent requests: 100.00
Total Token throughput (tok/s): 94.62
---------------Time to First Token----------------
Mean TTFT (ms): 6152.01
Median TTFT (ms): 5896.97
P99 TTFT (ms): 12042.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 390.24
Median TPOT (ms): 322.20
P99 TPOT (ms): 946.02
---------------Inter-token Latency----------------
Mean ITL (ms): 322.65
Median ITL (ms): 282.40
P99 ITL (ms): 970.22
==================================================