MTP+llama.cpp: a look at Qwen3.6-27B

There’s some growing excitement around MTP with llama.cpp (this PR): llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp · GitHub

I decided to give it a try on my Spark, using the Q4_K_M quant of Unsloth’s MTP version: unsloth/Qwen3.6-27B-MTP-GGUF · Hugging Face

I tried 4, 5, and 6 draft tokens. The results for 5 are below-- it seemed to give the best performance across concurrencies 1-4

                                      llama-benchy Results                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test                   ┃   c   ┃     pp t/s ┃     tg t/s ┃  TTFT (ms) ┃ Total (ms) ┃     Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0      │  c1   │        719 │       28.3 │      2,723 │      7,031 │   2048+128 │
│ pp2048 tg128 @ d0      │  c2   │        614 │       25.9 │      5,777 │     14,849 │   2048+128 │
│ pp2048 tg128 @ d0      │  c4   │        643 │       29.9 │     11,376 │     26,996 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c1   │        660 │       27.6 │      8,161 │     12,588 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c2   │        604 │       31.6 │     17,405 │     25,147 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c4   │        569 │       31.1 │     36,838 │     52,118 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c1   │        635 │       29.0 │     13,838 │     18,039 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c2   │        578 │       26.5 │     30,210 │     39,216 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c4   │        432 │       26.9 │     80,560 │     98,285 │   2048+128 │
└────────────────────────┴───────┴────────────┴────────────┴────────────┴────────────┴────────────┘

Build config:

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CPU_KLEIDIAI=ON

Command:

export CUDA_SCALE_LAUNCH_QUEUES=4x
export GGML_CUDA_GRAPH_OPT=1
export GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F=1

./build/bin/llama-server \
--hf-repo unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M \
--host 0.0.0.0 \
--port 8000 \
--alias qwen3.6-27b \
--parallel 8 \
--threads 10 \
--threads-batch 10 \
--threads-http 8 \
--prio 3 \
--poll 100 \
--direct-io \
--metrics \
--ctx-size 262144 \
--batch-size 32768 \
--ubatch-size 4096 \
--cache-ram 65536 \
--kv-unified \
-ctk f16 -ctv f16 \
--cache-reuse 1024 \
--ctx-checkpoints 128 \
--no-mmap \
--mlock \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.05 \
--repeat-penalty 1.05 \
--presence-penalty 0.00 \
--flash-attn on \
--chat-template-kwargs '{"preserve_thinking": true}' \
--chat-template-file ./froggeric.jinja \
--spec-type draft-mtp \
--spec-draft-n-max 5 \
--spec-draft-p-min 0.75

How much faster is that than with MTP disabled?

That draft value seems pretty high, I’ve only seen 2 or 3 recommended.

Do you find the dense model better for coding than the MoE? Are you using it for coding or something else? I’m super pleased with qwen36moe@8bit on my spark

@coder543 Unfortunately, I’m now getting persistent crashes trying to get the baseline numbers. I’m having the same issue that was reported on the PR, but it was working for a while.

@verdverm Yes 2-3 has seemed generally recommended for Qwen3.6, but I think the key is the new ability to specify min-p for the draft (--spec-draft-p-min 0.75) which boosts the acceptance at the higher draft tokens up to 7. I use Qwen for coding and I do prefer the MoE, but the MTP benefit for the dense model does seem appealing. I think 30 tk/s is usable, but llama.cpp doesn’t handle concurrency as well as vLLM.

I’ve got my llama-cpp set to parallel=1 so requests queue up, I haven’t been able to get vLLM running and the 30t/s (60t/s with MTP) is keeping me happy for the time being.

About to try out the ngram+mtp to see if that makes things even better. Need to look into that --draft-min-p

Here are some results without MTP. The addition of MTP really helps the lack of concurrency at the expense of slowing down concurrent requests:

                                        llama-benchy Results                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test                   ┃   c   ┃     pp t/s ┃     tg t/s ┃   TTFT (ms) ┃ Total (ms) ┃     Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0      │  c1   │      1,084 │       13.1 │       1,834 │     11,381 │   2048+128 │
│ pp2048 tg128 @ d0      │  c2   │        908 │       24.1 │       3,984 │     14,401 │   2048+128 │
│ pp2048 tg128 @ d0      │  c4   │        864 │       41.5 │       8,080 │     20,190 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c1   │        964 │       12.9 │       5,567 │     15,269 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c2   │        919 │       23.9 │      11,387 │     21,874 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c4   │        823 │       39.7 │      25,396 │     38,068 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c1   │        941 │       12.7 │       9,542 │     19,401 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c2   │        898 │       23.0 │      19,380 │     30,263 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c4   │        704 │       38.3 │      48,722 │     61,864 │   2048+128 │
└────────────────────────┴───────┴────────────┴────────────┴─────────────┴────────────┴────────────┘

I found the following blog post very interesting, not only mtp but also turboquant patches are applied. It lets me run both qwen3.6 and gemma4 simultaneously with max 256k context length.

MTP Speculative Decoding Actually Works on MoE: 144 t/s on a 16GB GPU

First I tried vllm, but I have upto 8 simultaneous users which makes it hard to run two models at max ctx on a single spark with ram allocation. That’s why I prefer llama.cpp, and mtp+turboquant version made me incredibly happy.

two interesting findings while using llama.cpp:

1- TC-33 Hallucination Resistance test fails on Gemma4-26B-A4B-it-UD-Q8_K_XL using ctk and ctv q8_0 but passes with turbo4 quant. I would expect an opposite result.

2- I used the chat-template from vllm tests and found the “fixed-chat-template-v5.jinja” stable on vllm. however, the short test in tool-eval-bench gives 2 partial passes whereas all the tests pass when I don’t change the chat-template on the MTP version of Qwen3.6-35B-A3B-UD-Q8_K_XL.

the perf-only benchmark on qwen3.6-35b using both mtp and turbo4:

Throughput Benchmark — qwen3.6-35b

  • Run ID: 2026-05-16T17-43-15Z_96bd71
  • Date: 2026-05-16T17:43:15.453891+00:00
  • Mode: throughput-only
  • tool-eval-bench: v1.6.0

Run Context

Parameter Value
Backend vllm
Server http://***:30000
Model (API) qwen3.6-35b
Temperature 0.0
Seed
Max Turns 8
Timeout 60.0s
Scenarios all (69)
Parallel 1 (sequential)
Error Rate 0.0
Thinking enabled

Environment

Property Value
Host gx10-53b6
Platform Linux-6.17.0-1014-nvidia-aarch64-with-glibc2.39
Python 3.12.3

Results

Test pp t/s tg t/s TTFT (ms) Total (ms) Tokens
pp2048 tg128 @ d0 1,386 60.4 1,380 3,424 2048+128
pp2048 tg128 @ d0 c2 1,013 72.7 3,779 7,061 2048+128
pp2048 tg128 @ d0 c4 1,056 77.3 7,078 13,227 2048+128
pp2048 tg128 @ d4096 1,471 65.1 3,889 5,780 2048+128
pp2048 tg128 @ d4096 c2 1,433 69.2 7,655 11,152 2048+128
pp2048 tg128 @ d4096 c4 1,242 44.9 16,158 24,125 2048+128
pp2048 tg128 @ d8192 1,395 62.2 6,827 8,807 2048+128
pp2048 tg128 @ d8192 c2 1,432 61.2 12,738 16,420 2048+128
pp2048 tg128 @ d8192 c4 1,307 25.7 23,054 33,139 2048+128

Can you share your reipces for those two simultaneously in one GX10?

sure.

btw, I have downloaded unsloth gguf UD-Q8_K_XL model files (qwen3.6 is the MTP version) using hf download to my ~/models folder beforehand.

here they are:

./mtp-turboquant/build/bin/llama-server \
  -m ~/models/MTP/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
  --alias "qwen3.6-35b" \
  --n-gpu-layers 999 \
  --flash-attn on \
  --no-mmap \
  --threads 8 \
  --batch-size 4096 \
  --ubatch-size 512 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  -c 262144 \
  --host 0.0.0.0 --port 30000 \
  -ctk turbo4 -ctv turbo4 \
  --jinja \
  --spec-type draft-mtp --spec-draft-n-max 3 \
./mtp-turboquant/build/bin/llama-server \
  -m ~/models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  --alias "gemma4-26b" \
  --n-gpu-layers 999 \
  --flash-attn on \
  --no-mmap \
  --threads 8 \
  -c 262144 \
  --host 0.0.0.0 --port 30001 \
  --jinja \
  -ctk turbo4 -ctv turbo4 \

there is an empty line at the end of the script and I sometimes add or remove some additional options, that is why I have “\” at the end.

and llama.cpp is compiled using somewhat standard cmake options: (clone and cd into the directory first)

cmake -B build \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121 \
  -DGGML_NATIVE=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j 20 --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

Thx for that, btw I’m curious but there is a reason behind this combo? Qwen3.6 and Gemma? why not Qwen3.6MOE (35) + Qwen3.6 (26) dense?

I find the dense one very slow on gb10. but I am thinking about using an orchestrator to switch the second one between gemma4 dense, qwen3.6 dense and gemma4 moe. qwen3.6 moe serves very well for coding purposes.

I can’t run your example… I get:

error while handling argument "-ctk": Unsupported cache type: turbo4

usage:
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)


to show complete usage, run with -h

Are you using a special branch/fork? Is it this one: GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput). · GitHub

In the blog post they linked, it lists this one: GitHub - NJannasch/llama.cpp at mtp-turboquant · GitHub

Many thanks. I was struggling with the other fork… It was working with a slightly different syntax… none the less… I will rebuild with this other fork for testing.

Update… that branch does not support turbo4… so I guess I will stick with the Atomic-ai build… until @gokhan.moral provide the actual fork/branch he used.

UPDATE

Doh… I built from master… not the mtp-turboquant branch… my bad. Solution for build:

git clone https://github.com/NJannasch/llama.cpp.git mtp-turboquant
cd mtp-turboquant
git checkout mtp-turboquant

cmake -B build \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121 \
  -DGGML_NATIVE=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j 20 --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

Model download:

HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/Qwen3.6-35B-A3B-MTP-GGUF Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf --local-dir <path>/models
HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf --local-dir <path>/models

I tried out Qwen3.6-35B-A3B Q5_K_M with MTP (not with turboquant), and 5-6 tokens does feel too high, but I’m thinking the draft min-p needs tuned specifically. The acceptance rate is sometimes very low, and never above 70% that I’ve seen. With lower draft tokens, it doesn’t seem to help. Regardless, here’s what I’ve been getting with mostly the same config, but a lowered ubatch to 2048:

                                                     llama-benchy Results                                                      
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Test                          ┃    c    ┃        pp t/s ┃        tg t/s ┃       TTFT (ms) ┃     Total (ms) ┃         Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0             │   c1    │         2,069 │          64.4 │             922 │          2,834 │       2048+128 │
│ pp2048 tg128 @ d0             │   c2    │         1,306 │          64.6 │           2,952 │          6,545 │       2048+128 │
│ pp2048 tg128 @ d0             │   c4    │         1,307 │          71.6 │           5,819 │         12,477 │       2048+128 │
│ pp2048 tg128 @ d4096          │   c1    │         2,049 │          60.1 │           2,619 │          4,674 │       2048+128 │
│ pp2048 tg128 @ d4096          │   c2    │         2,016 │          63.8 │           5,152 │          8,884 │       2048+128 │
│ pp2048 tg128 @ d4096          │   c4    │         1,716 │          70.3 │          12,571 │         19,222 │       2048+128 │
│ pp2048 tg128 @ d8192          │   c1    │         2,038 │          65.8 │           4,340 │          6,209 │       2048+128 │
│ pp2048 tg128 @ d8192          │   c2    │         1,411 │          62.0 │          13,198 │         16,949 │       2048+128 │
│ pp2048 tg128 @ d8192          │   c4    │         1,445 │          69.7 │          24,973 │         32,007 │       2048+128 │
└───────────────────────────────┴─────────┴───────────────┴───────────────┴─────────────────┴────────────────┴────────────────┘

For comparison, without MTP, this one beats it easily, so it needs work sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF · Hugging Face

                                        llama-benchy Results                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test                   ┃   c   ┃     pp t/s ┃     tg t/s ┃   TTFT (ms) ┃ Total (ms) ┃     Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0      │  c1   │      2,714 │       68.1 │         735 │      2,532 │   2048+128 │
│ pp2048 tg128 @ d0      │  c2   │      2,386 │       94.6 │       1,434 │      4,056 │   2048+128 │
│ pp2048 tg128 @ d0      │  c4   │      2,324 │      127.0 │       3,054 │      6,999 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c1   │      2,539 │       67.0 │       2,139 │      3,966 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c2   │      2,481 │       93.5 │       4,170 │      6,826 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c4   │      2,374 │      121.7 │       8,689 │     12,808 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c1   │      2,517 │       62.7 │       3,481 │      5,439 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c2   │      2,476 │       89.6 │       6,959 │      9,734 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c4   │      2,318 │      115.9 │      14,827 │     19,160 │   2048+128 │
└────────────────────────┴───────┴────────────┴────────────┴─────────────┴────────────┴────────────┘

Has anyone had issues running llama.cpp for long-lived serving workloads?

I tried it, and I’m happy with the PP and TG speeds. However, after about 20 minutes of serving with llama.cpp, the server suddenly stopped.

Compared to vLLM, which has been running 24/7 without issues, llama.cpp serving seems less reliable in my setup. I followed the instructions from @blainesworld at the top of this thread.

I started with the Q8 quantized model but am now using FP16. I’ve tested it with this option all day and everything seems fine. My llama-server is built from the llama.cpp #22673 branch.

./llama-server -m /Models/Qwen3.6-27B-F16-mtp.gguf
–spec-type mtp
–spec-draft-n-max 3
–cache-type-k q8_0
–cache-type-v q8_0
-np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99
–host 0.0.0.0
–port 8080

Hi - thanks for the detailed run-command
I just started using 8-bit from 4-bit so FP16 … what differences do you experience and why did you up from 8-bit. Are the outputs better for your use-case ?
I do not want to go the token/s race as i am now seeking for more usable local inference on my DGX.
I am quite happy with Qwen3.6-MoE and 27B seemed slow for me to try more but if the quality gets better I can live with lower t/s. I found out that pp-speed actually matters more as input is usually far more than output in my text-based and coding use-cases.

Input warmly welcome.

Hello. The reason I switched from the q8 model to the fp16 model was due to the nature of coding. As you mentioned, coding requires precise, line-by-line accuracy, which is different from conversational chatting. Since I am currently working on a personal project using “vibe coding,” it was also important for the model to accurately grasp my intent. While q8 provides good quality, I noticed a difference compared to fp16 after using it for an extended period.