My DGX Spark Setup (unsloth qwen36moe 2x, llama-cpp+mtp PR, ansible for easy mode)

Thought I’d share my setup from onboarding to my dgx spark since I got such help from you all while lurking this forum.

tl;dr

  • unsloth/qwen36moe - 8bit quant, MTP PR included
  • ansible to clone/compile/run llama-server
  • a few extras (like scion, highly recommend)

Which PR are you using? I’m always down for trying new settings that might give me some more performance/quality :)

I totally forgot to include the link to the github repo… :facepalm: and seems I cannot edit my original… :frustrating:

I’m using the PR referenced on the unsloth quants to go with the work, all the details should be in the README

Looking at their code, it’s this one llama + spec: MTP Support by am17an Β· Pull Request #22673 Β· ggml-org/llama.cpp Β· GitHub

I’m currently running some performance on that same PR with Unsloth’s MTP version of Qwen3.6-27B:Q4_K_M

@verdverm I noticed you’re not using Unsloth’s MTP versions here: sparky/util/llama-server.sh at main Β· verdverm/sparky Β· GitHub

That is what they suggest: Qwen3.6 - How to Run Locally | Unsloth Documentation

Check out the Ansible, that’s the latest, it supports running with MTP or not-MTP, and should be general enough for any model/quant

My apologies. I see it now, yes very configurable!

the 30t/s (60t/s with MTP) is keeping me happy for the time being

Based off what you said in the other thread ^^, I think you could squeeze more out of it if you want to be happier! I did lots of benchmarking with various llama-server options. The most impactful are the batch sizes. With the same config shared in my other thread, here’s Unsloth’s MXFP4 quant of Qwen3.5-35B-A3B without MTP:

                                       llama-benchy Results                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Test                   ┃   c   ┃     pp t/s ┃     tg t/s ┃  TTFT (ms) ┃ Total (ms) ┃     Tokens ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
β”‚ pp2048 tg128 @ d0      β”‚  c1   β”‚      2,773 β”‚       62.2 β”‚        732 β”‚      2,703 β”‚   2048+128 β”‚
β”‚ pp2048 tg128 @ d0      β”‚  c2   β”‚      2,414 β”‚       90.4 β”‚      1,464 β”‚      4,210 β”‚   2048+128 β”‚
β”‚ pp2048 tg128 @ d0      β”‚  c4   β”‚      2,360 β”‚      124.2 β”‚      2,972 β”‚      7,005 β”‚   2048+128 β”‚
β”‚ pp2048 tg128 @ d4096   β”‚  c1   β”‚      2,576 β”‚       60.1 β”‚      2,081 β”‚      4,124 β”‚   2048+128 β”‚
β”‚ pp2048 tg128 @ d4096   β”‚  c2   β”‚      2,557 β”‚       86.6 β”‚      4,107 β”‚      6,977 β”‚   2048+128 β”‚
β”‚ pp2048 tg128 @ d4096   β”‚  c4   β”‚      2,508 β”‚      114.9 β”‚      8,408 β”‚     12,774 β”‚   2048+128 β”‚
β”‚ pp2048 tg128 @ d8192   β”‚  c1   β”‚      2,589 β”‚       56.3 β”‚      3,425 β”‚      5,610 β”‚   2048+128 β”‚
β”‚ pp2048 tg128 @ d8192   β”‚  c2   β”‚      2,538 β”‚       81.7 β”‚      6,767 β”‚      9,814 β”‚   2048+128 β”‚
β”‚ pp2048 tg128 @ d8192   β”‚  c4   β”‚      2,393 β”‚      107.4 β”‚     14,351 β”‚     19,029 β”‚   2048+128 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

So with MTP added on, I think you could go higher than 60 t/s. I’m going to try it once I get past this crashing issue.