GLM 4.6V works on Spark!

eugr · December 8, 2025, 7:30pm

To run on a single Spark, you will need a 4-bit quant - those will be coming soon.
But you can run FP8 version on dual sparks and get 22 t/s. .

You can use my Docker build at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

However, you will need to perform extra actions inside the container in order to run this model. This is a first model to require the newest version of Transformers library - v5. It is still in release candidate phase, and there are some issues currently, so I’m not going to make it default in my build yet.

To run the model, you’ll have to enter running container on both nodes and run this command prior to launching the model:

pip install transformers>=5.0.0 --pre -U

Then you can launch the model on the head container using this command:

vllm serve zai-org/GLM-4.6V-FP8 \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --allowed-local-media-path / \
  --mm-encoder-tp-mode data \
  -tp 2 \
  --gpu-memory-utilization 0.7 \
  --distributed-executor-backend ray \
  --host 0.0.0.0 \
  --port 8888

Adjust the parameters as needed. Note, fastsafetensors works, but my vllm just froze during inference when I was benchmarking 100 requests, so I recommend not using it for now.

Some benchmarks:

vllm bench serve \
  --backend vllm \
  --model zai-org/GLM-4.6V-FP8 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --port 8888 \
  --host spark \
  --num-prompts 1

Single request:

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  5.18
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.19
Output token throughput (tok/s):         22.98
Peak output token throughput (tok/s):    24.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          25.30
---------------Time to First Token----------------
Mean TTFT (ms):                          163.40
Median TTFT (ms):                        163.40
P99 TTFT (ms):                           163.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          42.50
Median TPOT (ms):                        42.50
P99 TPOT (ms):                           42.50
---------------Inter-token Latency----------------
Mean ITL (ms):                           42.50
Median ITL (ms):                         42.06
P99 ITL (ms):                            52.69
==================================================

10 requests:

                            Output tokens per second
  80 +----------------------------------------------------------------------+
     |                                                                      |
  70 |    * *   *                                                           |
     |    * **  *                                                           |
     |    ***** **                                                          |
  60 |   * *   * *                                                          |
     | *** *   * *                                                          |
  50 |*           **            *                                           |
     |*             **         * *                                          |
  40 |*               ** ******  **                                         |
     |*                 **         *** ***** ***                 *          |
     |*                 *             *     *   ***************** ***       |
  30 |                                                               *      |
     |                                                                *     |
  20 |                                                                *     |
     |                                                                *     |
     |                                                                *     |
  10 |                                                                 *    |
     |                                                                 *    |
   0 +----------------------------------------------------------------------+
     0         10        20        30         40        50        60        70

                         Concurrent requests per second
  10 +----------------------------------------------------------------------+
     |  *                                                                   |
     |  **                                                                  |
     |    *                                                                 |
   8 |    *******                                                           |
     |           *                                                          |
     |           ***                                                        |
     |              *                                                       |
   6 |              **                                                      |
     |                **********                                            |
     |                          *                                           |
   4 |                          ***                                         |
     |                             *                                        |
     |                             **************                           |
     |                                           *                          |
   2 |                                           *********************      |
     |                                                                *     |
     |                                                                *     |
     |                                                                 *    |
   0 +----------------------------------------------------------------------+
     0         10        20        30         40        50        60        70
============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  64.80
Total input tokens:                      1371
Total generated tokens:                  2654
Request throughput (req/s):              0.15
Output token throughput (tok/s):         40.96
Peak output token throughput (tok/s):    72.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          62.11
---------------Time to First Token----------------
Mean TTFT (ms):                          890.60
Median TTFT (ms):                        970.14
P99 TTFT (ms):                           971.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          128.82
Median TPOT (ms):                        133.36
P99 TPOT (ms):                           172.51
---------------Inter-token Latency----------------
Mean ITL (ms):                           101.23
Median ITL (ms):                         94.25
P99 ITL (ms):                            183.08
==================================================

100 requests:

                             Output tokens per second
  400 +---------------------------------------------------------------------+
      |              *                                                      |
  350 |              *                                                      |
      |              *  *                                                   |
      |             * ***                                                   |
  300 |             * ** * **                                               |
      |             * *  * **  * * *                                        |
  250 |             *    **  *** * *  * * *  ** *                           |
      |             *     *  ** * * *** * * * * * *  * * ***                |
  200 |             *        *  * * ** * * **  * *** *** * * * **  *        |
      |             *               *  * * *   * *  * * *   **** * ** *     |
      |             *                                   *   * *  **  ***    |
  150 |             *                                             *  * ***  |
      |            *                                                     *  |
  100 |            *                                                     *  |
      |    *   *****                                                     *  |
      |    ****   **                                                      * |
   50 |   * *      *                                                      * |
      |****        *                                                      * |
    0 +---------------------------------------------------------------------+
      0           10         20          30          40         50          60

                          Concurrent requests per second
  100 +---------------------------------------------------------------------+
      |              *                                                      |
      |               ***                                                   |
      |                  *                                                  |
   80 |                   **                                                |
      |                     *                                               |
      |                      **********                                     |
      |                                ***********                          |
   60 |                                           ***********               |
      |                                                      ********       |
      |                                                              *****  |
   40 |                                                                  *  |
      |                                                                  *  |
      |                                                                  *  |
      |                                                                  *  |
   20 |                                                                   * |
      |                                                                   * |
      |                                                                   * |
      |                                                                   * |
    0 +---------------------------------------------------------------------+
      0           10         20          30          40         50          60
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  357.56
Total input tokens:                      22992
Total generated tokens:                  10842
Request throughput (req/s):              0.28
Output token throughput (tok/s):         30.32
Peak output token throughput (tok/s):    370.00
Peak concurrent requests:                100.00
Total Token throughput (tok/s):          94.62
---------------Time to First Token----------------
Mean TTFT (ms):                          6152.01
Median TTFT (ms):                        5896.97
P99 TTFT (ms):                           12042.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          390.24
Median TPOT (ms):                        322.20
P99 TPOT (ms):                           946.02
---------------Inter-token Latency----------------
Mean ITL (ms):                           322.65
Median ITL (ms):                         282.40
P99 ITL (ms):                            970.22
==================================================

eugr · December 8, 2025, 7:34pm

It passed my gene expression heatmap test with flying colors. So far, only Qwen3-VL-32B and Qwen3-VL-235B were able to correctly pass it :)

bobbercheng · December 29, 2025, 9:17pm

Thanks for your sharing, eugr. I only can run max-model-len as 8192 with one spark. If I disable both image and video, max-model-len can go up to 24.5K. Can I know how many sparks you use for GLM 4.6V test here?

eugr · December 29, 2025, 10:09pm

I’m running on two Sparks. On a single Spark you will need something like cyankiwi/GLM-4.6V-AWQ-4bit - it should fit just fine even with 128K context, as the weights are ~64GB.

To fit double context in FP8 version, you can use --kv-cache-dtype fp8 to quantize k_v cache if needed.

peterwern · January 3, 2026, 3:07pm

what am I getting wrong about quantization of GLM4.6V?

I am running the 8bit quant of unsloth/GLM-4.6V-GGUF (114GB) on one DGX-Spark.

./llama.cpp/build/bin/llama-server
-m GLM-4.6-V-FP8_GGUF/GLM-4.6V-Q8_0-00001-of-00003.gguf
–mmproj GLM-4.6-V-FP8_GGUF/mmproj-F16.gguf
–host 0.0.0.0
–port 8001
-c 15000
-a glm-4.6-V
-ngl 999
–cache-ram 0
-np 1
–temp 0.1
–top-k 20
–top-p 0.9
–min-p 0.05

Isn’t it the same precision as yours, eugr? It is slow and takes 98% of RAM (126GB), but it works for my case on a single DGX Spark.

Is it a lower precision as yours?

eugr · January 3, 2026, 4:59pm

No, it’s not the same, this one is int8, the one I ran was fp8. Slightly lower precision, but not very significant difference, I believe. FP8 can be more efficient on Spark due to native FP8 support though.

The biggest difference is that llama.cpp is much better optimized in terms of VRAM consumption than vLLM, so you can fit more.

PrinceHal · January 20, 2026, 2:10am

I’m running GLM-4.6V (zai-org/GLM-4.6V-FP8) on stacked Sparks using @eugr 's Docker configuration for running vLLM on dual Sparks. Yesterday and today I’ve been testing its OCR abilities and I must say they are amazing. I had previously looked at a number of different models and many are reasonably good but this is a big step up in quality.

The website for Paul Allen’s olmocr gives some sample files, including a handwritten order by Abraham Lincoln, that are processed very accurately. They allow you to upload your own files for testing, but I was disappointed by the accuracy for my own handwritten documents.

I scanned notes I had taken during my college years (1972-75) and decided to send a few to GLM-4.6V on my Sparks. My prompt was simple: “extract all the text from the image into markdown format /nothink”.

A single image is processed in about 20 - 25 seconds. I then sent 8 images simultaneously and they took a total of about 1 1/2 minutes to process all eight, an average of about 12 seconds each.

The accuracy of the text extraction approached 100% (some files were completely accurate and others were very close). Not only did my handwriting come through well, but so did typed pages from exams that were copies of copies and thus somewhat degraded.

I should mention that the /nothink modifier was necessary because without it the processing would take several times as long but without improving the already stupendous accuracy.

eugr · January 21, 2026, 5:07am

Yeah, it’s my favorite vision model so far. Previously it was Qwen3-VL-235B

PrinceHal · January 21, 2026, 4:53pm

I’m trying to add the no think flag to the vllm command when running your container but so far haven’t figured out the magic words. None of the following work:

–default-chat-template-kwargs ‘{“enable_thinking”: false}’
–no-thinking
extra_body={chat_template_kwargs: {enable_thinking: False}}
–extra-body {chat_template_kwargs: {enable_thinking: False}}

Any idea what the magic words are?

Also–and I may have mentioned this before–in OpenWebUI I get a Thinking… indicator with a down chevron but in other contexts I get the thinking followed by the answer with no indication that there’s a thinking part and a result part.

eugr · January 21, 2026, 7:03pm

Based on the vllm docs and the model chat template, this one should work: --default-chat-template-kwargs '{"enable_thinking": false}'
You can also try --chat-template-kwargs instead of --default-chat-template-kwargs.

PrinceHal · January 21, 2026, 7:48pm

Thanks for your reply. I read the same model chat template and I did try both of those (I listed my tries above but omitted the --chat-template-kwargs from the list). And my autocorrect changed - - to – in this message on the forum, but on the command line they were two dashes as you will see below.

Unfortunately, as I said, - - default-chat-template-kwargs didn’t work. I also tried - - chat-template-kwargs but that didn’t work either:

./launch-cluster.sh -n 169.254.246.76,169.254.28.198 -t vllm-node-whl-tf5 exec vllm serve zai-org/GLM-4.6V-FP8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --allowed-local-media-path / --mm-encoder-tp-mode data -tp 2 --gpu-memory-utilization 0.7 --distributed-executor-backend ray --host 0.0.0.0  --port 8000 --load-format fastsafetensors --chat-template-kwargs '{"enable_thinking": false}'

vllm: error: unrecognized arguments: --default-chat-template-kwargs {enable_thinking: false}

./launch-cluster.sh -n 169.254.246.76,169.254.28.198 -t vllm-node-whl-tf5 exec vllm serve zai-org/GLM-4.6V-FP8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --allowed-local-media-path / --mm-encoder-tp-mode data -tp 2 --gpu-memory-utilization 0.7 --distributed-executor-backend ray --host 0.0.0.0 --port 8000 --load-format fastsafetensors --chat-template-kwargs '{"enable_thinking": false}'

vllm: error: unrecognized arguments: --chat-template-kwargs {enable_thinking: false}

so I’m still at a loss.

eugr · January 21, 2026, 9:40pm

Yeah, the --chat-template-kwargs does not exist, you can only use --default-chat-template-kwargs. It should work, not sure why it doesn’t. Another option would be to create a copy of the chat template and add this at the top: {% set enable_thinking = false %}.

And then use this argument: --chat-template ./glm4.6v-nothink.jinja

PrinceHal · January 22, 2026, 7:33pm

Once again, Gene, you’ve done it. I actually edited the original jinja file in the cache, but I assume your command line switch will also work. And I suppose that’s safer.

Thanks for your help.

Topic		Replies	Views
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	3425	January 2, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2297	December 31, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	889	February 13, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	3900	February 27, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2002	December 25, 2025
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	32	1061	February 26, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4498	December 9, 2025
Dgx spark benchmark performance DGX Spark / GB10	17	1777	January 4, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	336	March 3, 2026
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	1394	December 7, 2025

GLM 4.6V works on Spark!

Related topics