Thought Iβd share my setup from onboarding to my dgx spark since I got such help from you all while lurking this forum.
tl;dr
unsloth/qwen36moe - 8bit quant, MTP PR included
ansible to clone/compile/run llama-server
a few extras (like scion, highly recommend)
Which PR are you using? Iβm always down for trying new settings that might give me some more performance/quality :)
I totally forgot to include the link to the github repo⦠:facepalm: and seems I cannot edit my original⦠:frustrating:
My DGX setup
Iβm using the PR referenced on the unsloth quants to go with the work, all the details should be in the README
Looking at their code, itβs this one llama + spec: MTP Support by am17an Β· Pull Request #22673 Β· ggml-org/llama.cpp Β· GitHub
Iβm currently running some performance on that same PR with Unslothβs MTP version of Qwen3.6-27B:Q4_K_M
Check out the Ansible, thatβs the latest, it supports running with MTP or not-MTP, and should be general enough for any model/quant
My apologies. I see it now, yes very configurable!
the 30t/s (60t/s with MTP) is keeping me happy for the time being
Based off what you said in the other thread ^^, I think you could squeeze more out of it if you want to be happier! I did lots of benchmarking with various llama-server options. The most impactful are the batch sizes. With the same config shared in my other thread, hereβs Unslothβs MXFP4 quant of Qwen3.5-35B-A3B without MTP :
llama-benchy Results
ββββββββββββββββββββββββββ³ββββββββ³βββββββββββββ³βββββββββββββ³βββββββββββββ³βββββββββββββ³βββββββββββββ
β Test β c β pp t/s β tg t/s β TTFT (ms) β Total (ms) β Tokens β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β pp2048 tg128 @ d0 β c1 β 2,773 β 62.2 β 732 β 2,703 β 2048+128 β
β pp2048 tg128 @ d0 β c2 β 2,414 β 90.4 β 1,464 β 4,210 β 2048+128 β
β pp2048 tg128 @ d0 β c4 β 2,360 β 124.2 β 2,972 β 7,005 β 2048+128 β
β pp2048 tg128 @ d4096 β c1 β 2,576 β 60.1 β 2,081 β 4,124 β 2048+128 β
β pp2048 tg128 @ d4096 β c2 β 2,557 β 86.6 β 4,107 β 6,977 β 2048+128 β
β pp2048 tg128 @ d4096 β c4 β 2,508 β 114.9 β 8,408 β 12,774 β 2048+128 β
β pp2048 tg128 @ d8192 β c1 β 2,589 β 56.3 β 3,425 β 5,610 β 2048+128 β
β pp2048 tg128 @ d8192 β c2 β 2,538 β 81.7 β 6,767 β 9,814 β 2048+128 β
β pp2048 tg128 @ d8192 β c4 β 2,393 β 107.4 β 14,351 β 19,029 β 2048+128 β
ββββββββββββββββββββββββββ΄ββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ
So with MTP added on, I think you could go higher than 60 t/s. Iβm going to try it once I get past this crashing issue.