MTP+llama.cpp: a look at Qwen3.6-27B

blainesworld · May 15, 2026, 10:55pm

There’s some growing excitement around MTP with llama.cpp (this PR): llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp · GitHub

I decided to give it a try on my Spark, using the Q4_K_M quant of Unsloth’s MTP version: unsloth/Qwen3.6-27B-MTP-GGUF · Hugging Face

I tried 4, 5, and 6 draft tokens. The results for 5 are below-- it seemed to give the best performance across concurrencies 1-4

                                      llama-benchy Results                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test                   ┃   c   ┃     pp t/s ┃     tg t/s ┃  TTFT (ms) ┃ Total (ms) ┃     Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0      │  c1   │        719 │       28.3 │      2,723 │      7,031 │   2048+128 │
│ pp2048 tg128 @ d0      │  c2   │        614 │       25.9 │      5,777 │     14,849 │   2048+128 │
│ pp2048 tg128 @ d0      │  c4   │        643 │       29.9 │     11,376 │     26,996 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c1   │        660 │       27.6 │      8,161 │     12,588 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c2   │        604 │       31.6 │     17,405 │     25,147 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c4   │        569 │       31.1 │     36,838 │     52,118 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c1   │        635 │       29.0 │     13,838 │     18,039 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c2   │        578 │       26.5 │     30,210 │     39,216 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c4   │        432 │       26.9 │     80,560 │     98,285 │   2048+128 │
└────────────────────────┴───────┴────────────┴────────────┴────────────┴────────────┴────────────┘

Build config:

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CPU_KLEIDIAI=ON

Command:

export CUDA_SCALE_LAUNCH_QUEUES=4x
export GGML_CUDA_GRAPH_OPT=1
export GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F=1

./build/bin/llama-server \
--hf-repo unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M \
--host 0.0.0.0 \
--port 8000 \
--alias qwen3.6-27b \
--parallel 8 \
--threads 10 \
--threads-batch 10 \
--threads-http 8 \
--prio 3 \
--poll 100 \
--direct-io \
--metrics \
--ctx-size 262144 \
--batch-size 32768 \
--ubatch-size 4096 \
--cache-ram 65536 \
--kv-unified \
-ctk f16 -ctv f16 \
--cache-reuse 1024 \
--ctx-checkpoints 128 \
--no-mmap \
--mlock \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.05 \
--repeat-penalty 1.05 \
--presence-penalty 0.00 \
--flash-attn on \
--chat-template-kwargs '{"preserve_thinking": true}' \
--chat-template-file ./froggeric.jinja \
--spec-type draft-mtp \
--spec-draft-n-max 5 \
--spec-draft-p-min 0.75

coder543 · May 15, 2026, 11:00pm

How much faster is that than with MTP disabled?

verdverm · May 15, 2026, 11:16pm

That draft value seems pretty high, I’ve only seen 2 or 3 recommended.

Do you find the dense model better for coding than the MoE? Are you using it for coding or something else? I’m super pleased with qwen36moe@8bit on my spark

blainesworld · May 15, 2026, 11:29pm

@coder543 Unfortunately, I’m now getting persistent crashes trying to get the baseline numbers. I’m having the same issue that was reported on the PR, but it was working for a while.

@verdverm Yes 2-3 has seemed generally recommended for Qwen3.6, but I think the key is the new ability to specify min-p for the draft (--spec-draft-p-min 0.75) which boosts the acceptance at the higher draft tokens up to 7. I use Qwen for coding and I do prefer the MoE, but the MTP benefit for the dense model does seem appealing. I think 30 tk/s is usable, but llama.cpp doesn’t handle concurrency as well as vLLM.

verdverm · May 15, 2026, 11:32pm

I’ve got my llama-cpp set to parallel=1 so requests queue up, I haven’t been able to get vLLM running and the 30t/s (60t/s with MTP) is keeping me happy for the time being.

About to try out the ngram+mtp to see if that makes things even better. Need to look into that --draft-min-p

blainesworld · May 16, 2026, 4:03pm

Here are some results without MTP. The addition of MTP really helps the lack of concurrency at the expense of slowing down concurrent requests:

                                        llama-benchy Results                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test                   ┃   c   ┃     pp t/s ┃     tg t/s ┃   TTFT (ms) ┃ Total (ms) ┃     Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0      │  c1   │      1,084 │       13.1 │       1,834 │     11,381 │   2048+128 │
│ pp2048 tg128 @ d0      │  c2   │        908 │       24.1 │       3,984 │     14,401 │   2048+128 │
│ pp2048 tg128 @ d0      │  c4   │        864 │       41.5 │       8,080 │     20,190 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c1   │        964 │       12.9 │       5,567 │     15,269 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c2   │        919 │       23.9 │      11,387 │     21,874 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c4   │        823 │       39.7 │      25,396 │     38,068 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c1   │        941 │       12.7 │       9,542 │     19,401 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c2   │        898 │       23.0 │      19,380 │     30,263 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c4   │        704 │       38.3 │      48,722 │     61,864 │   2048+128 │
└────────────────────────┴───────┴────────────┴────────────┴─────────────┴────────────┴────────────┘

gokhan.moral · May 16, 2026, 5:44pm

I found the following blog post very interesting, not only mtp but also turboquant patches are applied. It lets me run both qwen3.6 and gemma4 simultaneously with max 256k context length.

MTP Speculative Decoding Actually Works on MoE: 144 t/s on a 16GB GPU

First I tried vllm, but I have upto 8 simultaneous users which makes it hard to run two models at max ctx on a single spark with ram allocation. That’s why I prefer llama.cpp, and mtp+turboquant version made me incredibly happy.

two interesting findings while using llama.cpp:

1- TC-33 Hallucination Resistance test fails on Gemma4-26B-A4B-it-UD-Q8_K_XL using ctk and ctv q8_0 but passes with turbo4 quant. I would expect an opposite result.

2- I used the chat-template from vllm tests and found the “fixed-chat-template-v5.jinja” stable on vllm. however, the short test in tool-eval-bench gives 2 partial passes whereas all the tests pass when I don’t change the chat-template on the MTP version of Qwen3.6-35B-A3B-UD-Q8_K_XL.

the perf-only benchmark on qwen3.6-35b using both mtp and turbo4:

Throughput Benchmark — qwen3.6-35b

Run ID: 2026-05-16T17-43-15Z_96bd71
Date: 2026-05-16T17:43:15.453891+00:00
Mode: throughput-only
tool-eval-bench: v1.6.0

Run Context

Parameter	Value
Backend	vllm
Server	`http://***:30000`
Model (API)	`qwen3.6-35b`
Temperature	0.0
Seed	—
Max Turns	8
Timeout	60.0s
Scenarios	all (69)
Parallel	1 (sequential)
Error Rate	0.0
Thinking	enabled

Environment

Property	Value
Host	`gx10-53b6`
Platform	`Linux-6.17.0-1014-nvidia-aarch64-with-glibc2.39`
Python	3.12.3

Results

Test	pp t/s	tg t/s	TTFT (ms)	Total (ms)	Tokens
pp2048 tg128 @ d0	1,386	60.4	1,380	3,424	2048+128
pp2048 tg128 @ d0 c2	1,013	72.7	3,779	7,061	2048+128
pp2048 tg128 @ d0 c4	1,056	77.3	7,078	13,227	2048+128
pp2048 tg128 @ d4096	1,471	65.1	3,889	5,780	2048+128
pp2048 tg128 @ d4096 c2	1,433	69.2	7,655	11,152	2048+128
pp2048 tg128 @ d4096 c4	1,242	44.9	16,158	24,125	2048+128
pp2048 tg128 @ d8192	1,395	62.2	6,827	8,807	2048+128
pp2048 tg128 @ d8192 c2	1,432	61.2	12,738	16,420	2048+128
pp2048 tg128 @ d8192 c4	1,307	25.7	23,054	33,139	2048+128

kafej666 · May 16, 2026, 6:21pm

Can you share your reipces for those two simultaneously in one GX10?

gokhan.moral · May 16, 2026, 7:13pm

sure.

btw, I have downloaded unsloth gguf UD-Q8_K_XL model files (qwen3.6 is the MTP version) using hf download to my ~/models folder beforehand.

here they are:

./mtp-turboquant/build/bin/llama-server \
  -m ~/models/MTP/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
  --alias "qwen3.6-35b" \
  --n-gpu-layers 999 \
  --flash-attn on \
  --no-mmap \
  --threads 8 \
  --batch-size 4096 \
  --ubatch-size 512 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  -c 262144 \
  --host 0.0.0.0 --port 30000 \
  -ctk turbo4 -ctv turbo4 \
  --jinja \
  --spec-type draft-mtp --spec-draft-n-max 3 \

./mtp-turboquant/build/bin/llama-server \
  -m ~/models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  --alias "gemma4-26b" \
  --n-gpu-layers 999 \
  --flash-attn on \
  --no-mmap \
  --threads 8 \
  -c 262144 \
  --host 0.0.0.0 --port 30001 \
  --jinja \
  -ctk turbo4 -ctv turbo4 \

there is an empty line at the end of the script and I sometimes add or remove some additional options, that is why I have “\” at the end.

and llama.cpp is compiled using somewhat standard cmake options: (clone and cd into the directory first)

cmake -B build \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121 \
  -DGGML_NATIVE=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j 20 --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

kafej666 · May 16, 2026, 7:41pm

Thx for that, btw I’m curious but there is a reason behind this combo? Qwen3.6 and Gemma? why not Qwen3.6MOE (35) + Qwen3.6 (26) dense?

gokhan.moral · May 16, 2026, 7:54pm

I find the dense one very slow on gb10. but I am thinking about using an orchestrator to switch the second one between gemma4 dense, qwen3.6 dense and gemma4 moe. qwen3.6 moe serves very well for coding purposes.

bernardlbmi3 · May 16, 2026, 10:26pm

I can’t run your example… I get:

error while handling argument "-ctk": Unsupported cache type: turbo4

usage:
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)


to show complete usage, run with -h

Are you using a special branch/fork? Is it this one: GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput). · GitHub

blainesworld · May 16, 2026, 10:42pm

In the blog post they linked, it lists this one: GitHub - NJannasch/llama.cpp at mtp-turboquant · GitHub

bernardlbmi3 · May 16, 2026, 10:57pm

Many thanks. I was struggling with the other fork… It was working with a slightly different syntax… none the less… I will rebuild with this other fork for testing.

bernardlbmi3 · May 16, 2026, 11:29pm

Update… that branch does not support turbo4… so I guess I will stick with the Atomic-ai build… until @gokhan.moral provide the actual fork/branch he used.

UPDATE

Doh… I built from master… not the mtp-turboquant branch… my bad. Solution for build:

git clone https://github.com/NJannasch/llama.cpp.git mtp-turboquant
cd mtp-turboquant
git checkout mtp-turboquant

cmake -B build \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121 \
  -DGGML_NATIVE=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j 20 --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

Model download:

HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/Qwen3.6-35B-A3B-MTP-GGUF Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf --local-dir <path>/models
HF_HUB_ENABLE_HF_TRANSFER=1 hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf --local-dir <path>/models

blainesworld · May 17, 2026, 12:50am

I tried out Qwen3.6-35B-A3B Q5_K_M with MTP (not with turboquant), and 5-6 tokens does feel too high, but I’m thinking the draft min-p needs tuned specifically. The acceptance rate is sometimes very low, and never above 70% that I’ve seen. With lower draft tokens, it doesn’t seem to help. Regardless, here’s what I’ve been getting with mostly the same config, but a lowered ubatch to 2048:

                                                     llama-benchy Results                                                      
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Test                          ┃    c    ┃        pp t/s ┃        tg t/s ┃       TTFT (ms) ┃     Total (ms) ┃         Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0             │   c1    │         2,069 │          64.4 │             922 │          2,834 │       2048+128 │
│ pp2048 tg128 @ d0             │   c2    │         1,306 │          64.6 │           2,952 │          6,545 │       2048+128 │
│ pp2048 tg128 @ d0             │   c4    │         1,307 │          71.6 │           5,819 │         12,477 │       2048+128 │
│ pp2048 tg128 @ d4096          │   c1    │         2,049 │          60.1 │           2,619 │          4,674 │       2048+128 │
│ pp2048 tg128 @ d4096          │   c2    │         2,016 │          63.8 │           5,152 │          8,884 │       2048+128 │
│ pp2048 tg128 @ d4096          │   c4    │         1,716 │          70.3 │          12,571 │         19,222 │       2048+128 │
│ pp2048 tg128 @ d8192          │   c1    │         2,038 │          65.8 │           4,340 │          6,209 │       2048+128 │
│ pp2048 tg128 @ d8192          │   c2    │         1,411 │          62.0 │          13,198 │         16,949 │       2048+128 │
│ pp2048 tg128 @ d8192          │   c4    │         1,445 │          69.7 │          24,973 │         32,007 │       2048+128 │
└───────────────────────────────┴─────────┴───────────────┴───────────────┴─────────────────┴────────────────┴────────────────┘

For comparison, without MTP, this one beats it easily, so it needs work sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF · Hugging Face

                                        llama-benchy Results                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test                   ┃   c   ┃     pp t/s ┃     tg t/s ┃   TTFT (ms) ┃ Total (ms) ┃     Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0      │  c1   │      2,714 │       68.1 │         735 │      2,532 │   2048+128 │
│ pp2048 tg128 @ d0      │  c2   │      2,386 │       94.6 │       1,434 │      4,056 │   2048+128 │
│ pp2048 tg128 @ d0      │  c4   │      2,324 │      127.0 │       3,054 │      6,999 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c1   │      2,539 │       67.0 │       2,139 │      3,966 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c2   │      2,481 │       93.5 │       4,170 │      6,826 │   2048+128 │
│ pp2048 tg128 @ d4096   │  c4   │      2,374 │      121.7 │       8,689 │     12,808 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c1   │      2,517 │       62.7 │       3,481 │      5,439 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c2   │      2,476 │       89.6 │       6,959 │      9,734 │   2048+128 │
│ pp2048 tg128 @ d8192   │  c4   │      2,318 │      115.9 │      14,827 │     19,160 │   2048+128 │
└────────────────────────┴───────┴────────────┴────────────┴─────────────┴────────────┴────────────┘

THUNDER_SPARK · May 17, 2026, 4:27am

Has anyone had issues running llama.cpp for long-lived serving workloads?

I tried it, and I’m happy with the PP and TG speeds. However, after about 20 minutes of serving with llama.cpp, the server suddenly stopped.

Compared to vLLM, which has been running 24/7 without issues, llama.cpp serving seems less reliable in my setup. I followed the instructions from @blainesworld at the top of this thread.

Mkei88 · May 17, 2026, 5:31am

I started with the Q8 quantized model but am now using FP16. I’ve tested it with this option all day and everything seems fine. My llama-server is built from the llama.cpp #22673 branch.

./llama-server -m /Models/Qwen3.6-27B-F16-mtp.gguf
–spec-type mtp
–spec-draft-n-max 3
–cache-type-k q8_0
–cache-type-v q8_0
-np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99
–host 0.0.0.0
–port 8080

christian.gintenreiter · May 17, 2026, 9:42am

Hi - thanks for the detailed run-command
I just started using 8-bit from 4-bit so FP16 … what differences do you experience and why did you up from 8-bit. Are the outputs better for your use-case ?
I do not want to go the token/s race as i am now seeking for more usable local inference on my DGX.
I am quite happy with Qwen3.6-MoE and 27B seemed slow for me to try more but if the quality gets better I can live with lower t/s. I found out that pp-speed actually matters more as input is usually far more than output in my text-based and coding use-cases.

Input warmly welcome.

Mkei88 · May 17, 2026, 1:09pm

Hello. The reason I switched from the q8 model to the fp16 model was due to the nature of coding. As you mentioned, coding requires precise, line-by-line accuracy, which is different from conversational chatting. Since I am currently working on a personal project using “vibe coding,” it was also important for the model to accurately grasp my intent. While q8 provides good quality, I noticed a difference compared to fp16 after using it for an extended period.

Topic		Replies	Views
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2923	February 25, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	434	24221	June 24, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	64	22264	July 6, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	854	March 3, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	309	29710	June 22, 2026
50%+ Improvement on spark?! DGX Spark / GB10 cuda , deepseek	25	2569	March 24, 2026
(sparkrun) Qwen3.5 GGUF Benchmarks over llama.cpp RPC DGX Spark / GB10 Projects llama	3	800	March 11, 2026
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	12	1757	January 7, 2026
Compiling llama.cpp DGX Spark / GB10 llama	14	2533	February 7, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6418	March 16, 2026

MTP+llama.cpp: a look at Qwen3.6-27B

Throughput Benchmark — qwen3.6-35b

Run Context

Environment

Results

Related topics