There’s some growing excitement around MTP with llama.cpp (this PR): llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp · GitHub I decided to give it a try on my Spark, using the Q4_K_M quant of Unsloth’s MTP version: unsloth/Qwen3.6-27B-MTP-GGUF · Hugging Face I tried …

How much faster is that than with MTP disabled?

That draft value seems pretty high, I’ve only seen 2 or 3 recommended. Do you find the dense model better for coding than the MoE? Are you using it for coding or something else? I’m super pleased with qwen36moe@8bit on my spark

@coder543 Unfortunately, I’m now getting persistent crashes trying to get the baseline numbers. I’m having the same issue that was reported on the PR, but it was working for a while. @verdverm Yes 2-3 has seemed generally recommended for Qwen3.6, but I think the key is the new ability to specify mi…

I’ve got my llama-cpp set to parallel=1 so requests queue up, I haven’t been able to get vLLM running and the 30t/s (60t/s with MTP) is keeping me happy for the time being. About to try out the ngram+mtp to see if that makes things even better. Need to look into that --draft-min-p

Here are some results without MTP. The addition of MTP really helps the lack of concurrency at the expense of slowing down concurrent requests: llama-benchy Results ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━…

I found the following blog post very interesting, not only mtp but also turboquant patches are applied. It lets me run both qwen3.6 and gemma4 simultaneously with max 256k context length. MTP Speculative Decoding Actually Works on MoE: 144 t/s on a 16GB GPU First I tried vllm, but I have upto 8 si…

Can you share your reipces for those two simultaneously in one GX10?

sure. btw, I have downloaded unsloth gguf UD-Q8_K_XL model files (qwen3.6 is the MTP version) using hf download to my ~/models folder beforehand. here they are: ./mtp-turboquant/build/bin/llama-server \ -m ~/models/MTP/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \ --alias "qwen3.6-35b" \ --n-gpu-layers…

Thx for that, btw I’m curious but there is a reason behind this combo? Qwen3.6 and Gemma? why not Qwen3.6MOE (35) + Qwen3.6 (26) dense?

MTP+llama.cpp: a look at Qwen3.6-27B

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

gokhan.moral May 16, 2026, 7:54pm 11

I find the dense one very slow on gb10. but I am thinking about using an orchestrator to switch the second one between gemma4 dense, qwen3.6 dense and gemma4 moe. qwen3.6 moe serves very well for coding purposes.

Topic		Replies	Views
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2622	February 25, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	408	18264	May 26, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	720	March 3, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	277	23508	June 1, 2026
50%+ Improvement on spark?! DGX Spark / GB10 cuda , deepseek	26	2253	April 7, 2026
(sparkrun) Qwen3.5 GGUF Benchmarks over llama.cpp RPC DGX Spark / GB10 Projects llama	3	719	March 11, 2026
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	13	1584	January 7, 2026
Compiling llama.cpp DGX Spark / GB10 llama	14	2040	February 7, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5703	March 16, 2026
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	28	7360	January 20, 2026

MTP+llama.cpp: a look at Qwen3.6-27B

Related topics