GLM 4.6V works on Spark!

To run on a single Spark, you will need a 4-bit quant - those will be coming soon.
But you can run FP8 version on dual sparks and get 22 t/s. .

You can use my Docker build at GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

However, you will need to perform extra actions inside the container in order to run this model. This is a first model to require the newest version of Transformers library - v5. It is still in release candidate phase, and there are some issues currently, so I’m not going to make it default in my build yet.

To run the model, you’ll have to enter running container on both nodes and run this command prior to launching the model:

pip install transformers>=5.0.0 --pre -U

Then you can launch the model on the head container using this command:

vllm serve zai-org/GLM-4.6V-FP8 \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --allowed-local-media-path / \
  --mm-encoder-tp-mode data \
  -tp 2 \
  --gpu-memory-utilization 0.7 \
  --distributed-executor-backend ray \
  --host 0.0.0.0 \
  --port 8888

Adjust the parameters as needed. Note, fastsafetensors works, but my vllm just froze during inference when I was benchmarking 100 requests, so I recommend not using it for now.

Some benchmarks:

vllm bench serve \
  --backend vllm \
  --model zai-org/GLM-4.6V-FP8 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --port 8888 \
  --host spark \
  --num-prompts 1

Single request:

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  5.18
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.19
Output token throughput (tok/s):         22.98
Peak output token throughput (tok/s):    24.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          25.30
---------------Time to First Token----------------
Mean TTFT (ms):                          163.40
Median TTFT (ms):                        163.40
P99 TTFT (ms):                           163.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          42.50
Median TPOT (ms):                        42.50
P99 TPOT (ms):                           42.50
---------------Inter-token Latency----------------
Mean ITL (ms):                           42.50
Median ITL (ms):                         42.06
P99 ITL (ms):                            52.69
==================================================

10 requests:

                            Output tokens per second
  80 +----------------------------------------------------------------------+
     |                                                                      |
  70 |    * *   *                                                           |
     |    * **  *                                                           |
     |    ***** **                                                          |
  60 |   * *   * *                                                          |
     | *** *   * *                                                          |
  50 |*           **            *                                           |
     |*             **         * *                                          |
  40 |*               ** ******  **                                         |
     |*                 **         *** ***** ***                 *          |
     |*                 *             *     *   ***************** ***       |
  30 |                                                               *      |
     |                                                                *     |
  20 |                                                                *     |
     |                                                                *     |
     |                                                                *     |
  10 |                                                                 *    |
     |                                                                 *    |
   0 +----------------------------------------------------------------------+
     0         10        20        30         40        50        60        70

                         Concurrent requests per second
  10 +----------------------------------------------------------------------+
     |  *                                                                   |
     |  **                                                                  |
     |    *                                                                 |
   8 |    *******                                                           |
     |           *                                                          |
     |           ***                                                        |
     |              *                                                       |
   6 |              **                                                      |
     |                **********                                            |
     |                          *                                           |
   4 |                          ***                                         |
     |                             *                                        |
     |                             **************                           |
     |                                           *                          |
   2 |                                           *********************      |
     |                                                                *     |
     |                                                                *     |
     |                                                                 *    |
   0 +----------------------------------------------------------------------+
     0         10        20        30         40        50        60        70
============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  64.80
Total input tokens:                      1371
Total generated tokens:                  2654
Request throughput (req/s):              0.15
Output token throughput (tok/s):         40.96
Peak output token throughput (tok/s):    72.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          62.11
---------------Time to First Token----------------
Mean TTFT (ms):                          890.60
Median TTFT (ms):                        970.14
P99 TTFT (ms):                           971.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          128.82
Median TPOT (ms):                        133.36
P99 TPOT (ms):                           172.51
---------------Inter-token Latency----------------
Mean ITL (ms):                           101.23
Median ITL (ms):                         94.25
P99 ITL (ms):                            183.08
==================================================

100 requests:

                             Output tokens per second
  400 +---------------------------------------------------------------------+
      |              *                                                      |
  350 |              *                                                      |
      |              *  *                                                   |
      |             * ***                                                   |
  300 |             * ** * **                                               |
      |             * *  * **  * * *                                        |
  250 |             *    **  *** * *  * * *  ** *                           |
      |             *     *  ** * * *** * * * * * *  * * ***                |
  200 |             *        *  * * ** * * **  * *** *** * * * **  *        |
      |             *               *  * * *   * *  * * *   **** * ** *     |
      |             *                                   *   * *  **  ***    |
  150 |             *                                             *  * ***  |
      |            *                                                     *  |
  100 |            *                                                     *  |
      |    *   *****                                                     *  |
      |    ****   **                                                      * |
   50 |   * *      *                                                      * |
      |****        *                                                      * |
    0 +---------------------------------------------------------------------+
      0           10         20          30          40         50          60

                          Concurrent requests per second
  100 +---------------------------------------------------------------------+
      |              *                                                      |
      |               ***                                                   |
      |                  *                                                  |
   80 |                   **                                                |
      |                     *                                               |
      |                      **********                                     |
      |                                ***********                          |
   60 |                                           ***********               |
      |                                                      ********       |
      |                                                              *****  |
   40 |                                                                  *  |
      |                                                                  *  |
      |                                                                  *  |
      |                                                                  *  |
   20 |                                                                   * |
      |                                                                   * |
      |                                                                   * |
      |                                                                   * |
    0 +---------------------------------------------------------------------+
      0           10         20          30          40         50          60
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  357.56
Total input tokens:                      22992
Total generated tokens:                  10842
Request throughput (req/s):              0.28
Output token throughput (tok/s):         30.32
Peak output token throughput (tok/s):    370.00
Peak concurrent requests:                100.00
Total Token throughput (tok/s):          94.62
---------------Time to First Token----------------
Mean TTFT (ms):                          6152.01
Median TTFT (ms):                        5896.97
P99 TTFT (ms):                           12042.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          390.24
Median TPOT (ms):                        322.20
P99 TPOT (ms):                           946.02
---------------Inter-token Latency----------------
Mean ITL (ms):                           322.65
Median ITL (ms):                         282.40
P99 ITL (ms):                            970.22
==================================================
5 Likes

It passed my gene expression heatmap test with flying colors. So far, only Qwen3-VL-32B and Qwen3-VL-235B were able to correctly pass it :)

2 Likes

Thanks for your sharing, eugr. I only can run max-model-len as 8192 with one spark. If I disable both image and video, max-model-len can go up to 24.5K. Can I know how many sparks you use for GLM 4.6V test here?

I’m running on two Sparks. On a single Spark you will need something like cyankiwi/GLM-4.6V-AWQ-4bit - it should fit just fine even with 128K context, as the weights are ~64GB.

To fit double context in FP8 version, you can use --kv-cache-dtype fp8 to quantize k_v cache if needed.

what am I getting wrong about quantization of GLM4.6V?

I am running the 8bit quant of unsloth/GLM-4.6V-GGUF (114GB) on one DGX-Spark.

./llama.cpp/build/bin/llama-server
-m GLM-4.6-V-FP8_GGUF/GLM-4.6V-Q8_0-00001-of-00003.gguf
–mmproj GLM-4.6-V-FP8_GGUF/mmproj-F16.gguf
–host 0.0.0.0
–port 8001
-c 15000
-a glm-4.6-V
-ngl 999
–cache-ram 0
-np 1
–temp 0.1
–top-k 20
–top-p 0.9
–min-p 0.05

Isn’t it the same precision as yours, eugr? It is slow and takes 98% of RAM (126GB), but it works for my case on a single DGX Spark.

Is it a lower precision as yours?

No, it’s not the same, this one is int8, the one I ran was fp8. Slightly lower precision, but not very significant difference, I believe. FP8 can be more efficient on Spark due to native FP8 support though.

The biggest difference is that llama.cpp is much better optimized in terms of VRAM consumption than vLLM, so you can fit more.

I’m running GLM-4.6V (zai-org/GLM-4.6V-FP8) on stacked Sparks using @eugr 's Docker configuration for running vLLM on dual Sparks. Yesterday and today I’ve been testing its OCR abilities and I must say they are amazing. I had previously looked at a number of different models and many are reasonably good but this is a big step up in quality.

The website for Paul Allen’s olmocr gives some sample files, including a handwritten order by Abraham Lincoln, that are processed very accurately. They allow you to upload your own files for testing, but I was disappointed by the accuracy for my own handwritten documents.

I scanned notes I had taken during my college years (1972-75) and decided to send a few to GLM-4.6V on my Sparks. My prompt was simple: ā€œextract all the text from the image into markdown format /nothinkā€.

A single image is processed in about 20 - 25 seconds. I then sent 8 images simultaneously and they took a total of about 1 1/2 minutes to process all eight, an average of about 12 seconds each.

The accuracy of the text extraction approached 100% (some files were completely accurate and others were very close). Not only did my handwriting come through well, but so did typed pages from exams that were copies of copies and thus somewhat degraded.

I should mention that the /nothink modifier was necessary because without it the processing would take several times as long but without improving the already stupendous accuracy.

1 Like

Yeah, it’s my favorite vision model so far. Previously it was Qwen3-VL-235B

I’m trying to add the no think flag to the vllm command when running your container but so far haven’t figured out the magic words. None of the following work:

  • –default-chat-template-kwargs ā€˜{ā€œenable_thinkingā€: false}’
  • –no-thinking
  • extra_body={chat_template_kwargs: {enable_thinking: False}}
  • –extra-body {chat_template_kwargs: {enable_thinking: False}}

Any idea what the magic words are?

Also–and I may have mentioned this before–in OpenWebUI I get a Thinking… indicator with a down chevron but in other contexts I get the thinking followed by the answer with no indication that there’s a thinking part and a result part.

Based on the vllm docs and the model chat template, this one should work: --default-chat-template-kwargs '{"enable_thinking": false}'
You can also try --chat-template-kwargs instead of --default-chat-template-kwargs.

Thanks for your reply. I read the same model chat template and I did try both of those (I listed my tries above but omitted the --chat-template-kwargs from the list). And my autocorrect changed - - to – in this message on the forum, but on the command line they were two dashes as you will see below.

Unfortunately, as I said, - - default-chat-template-kwargs didn’t work. I also tried - - chat-template-kwargs but that didn’t work either:

./launch-cluster.sh -n 169.254.246.76,169.254.28.198 -t vllm-node-whl-tf5 exec vllm serve zai-org/GLM-4.6V-FP8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --allowed-local-media-path / --mm-encoder-tp-mode data -tp 2 --gpu-memory-utilization 0.7 --distributed-executor-backend ray --host 0.0.0.0  --port 8000 --load-format fastsafetensors --chat-template-kwargs '{"enable_thinking": false}'

vllm: error: unrecognized arguments: --default-chat-template-kwargs {enable_thinking: false}
./launch-cluster.sh -n 169.254.246.76,169.254.28.198 -t vllm-node-whl-tf5 exec vllm serve zai-org/GLM-4.6V-FP8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --allowed-local-media-path / --mm-encoder-tp-mode data -tp 2 --gpu-memory-utilization 0.7 --distributed-executor-backend ray --host 0.0.0.0 --port 8000 --load-format fastsafetensors --chat-template-kwargs '{"enable_thinking": false}'

vllm: error: unrecognized arguments: --chat-template-kwargs {enable_thinking: false}

so I’m still at a loss.

Yeah, the --chat-template-kwargs does not exist, you can only use --default-chat-template-kwargs. It should work, not sure why it doesn’t. Another option would be to create a copy of the chat template and add this at the top: {% set enable_thinking = false %}.

And then use this argument: --chat-template ./glm4.6v-nothink.jinja

1 Like

Once again, Gene, you’ve done it. I actually edited the original jinja file in the cache, but I assume your command line switch will also work. And I suppose that’s safer.

Thanks for your help.

1 Like