56 tok/s on dual Spark … tempting, but that requires a second Spark! Though for me personally, buying a second Spark just for ~70% scaling is hard to justify — especially in my country where power isn’t always guaranteed and I’m running off a power station from time to time :) Would need 4x Spark to unlock 250B+ models, and that’s a different budget entirely and not enough electricity right now. So I’ll keep squeezing every last drop out of a single Spark — maybe there’s still something left to find. Would love to hear if the community has any new ideas worth trying.
Has anyone tested EAGLE-3 instead of MTP-1 (arXiv:2503.01840)? Are there any quality issues with it?
@flash3 did some work with EAGLE vllm-marlin-sm12x/RESULTS.md at main · flash7777/vllm-marlin-sm12x · GitHub
@Albond Thank you for this thread. I noticed something that may be interesting to you:
–speculative-config '{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:2}'
This is documented in the README.md within the model contents for Intel/Qwen3.5-35B-A3B-int4-AutoRound, but it is not mentioned in the README.md of the model contents of Intel/Qwen3.5-122B-A10B-int4-AutoRound . Have you tried this to any success?
Thank you for this param. Up to +20% free! I will update project and main post soon with some new updates and include it to scripts.
But want to check the best: EAGLE3 vs MTP-2. I think, EAGLE3 should be better. Let’s see.
DGX has more success on lower NST (f.i. NST=1) instead of RTX (NST=3). DGX is memory bound.
Nice initiative, I really like the idea, going to try it out.
FYI for the initial patch call, I had to change the flags to --fp8-repo and --output from what is shown in the main README.
Here are my results (Single DGX Spark, non-headless), using the the vllm from eugr, and just adding the speculative-config to the command line
./run-recipe.sh qwen3.5-122b-int4-autoround --solo -e HF_TOKEN=$HF_TOKEN – --speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”: 1}’
── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 7.13s = 35.9 tok/s (prompt: 23)
[Code] 502 tokens in 13.41s = 37.4 tok/s (prompt: 30)
[JSON] 1024 tokens in 27.85s = 36.7 tok/s (prompt: 48)
[Math] 64 tokens in 1.91s = 33.5 tok/s (prompt: 29)
[LongCode] 2048 tokens in 54.14s = 37.8 tok/s (prompt: 37)
./run-recipe.sh qwen3.5-122b-int4-autoround --solo -e HF_TOKEN=$HF_TOKEN – --speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”: 2}’
── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 6.52s = 39.2 tok/s (prompt: 23)
[Code] 512 tokens in 12.80s = 40.0 tok/s (prompt: 30)
[JSON] 1024 tokens in 25.83s = 39.6 tok/s (prompt: 48)
[Math] 64 tokens in 1.72s = 37.2 tok/s (prompt: 29)
[LongCode] 2048 tokens in 49.87s = 41.0 tok/s (prompt: 37)
Had similar … but stuck with EAGLE3 to compare EAGLE3 and MTP-2. EAGLE3 has trouble with DeltaNet. Right now MTP-2 is the best for performance improvement.
EAGLE3 is slower than MTP-2 on Qwen3.5 hybrid (DeltaNet) architecture. Will update soon. Thank you for such features!
Big picture, if you want higher density I suggest running a Qwen3.5-27B quant with MTP over Gemini. I prefer like the FP8 version with RYS-XL, which has a few duplicated layers with all features presented for higher intelligence at ~31B total parameters. With MTP that model gives 11-12 tokens/s and frankly seems smarter than Gemini 31B.
Sorry for the tangent, back to the 122B hybrid…
I’ve been seeing consistent issues with compact models, and for me, a LLM below 100b is accompanied by an increase in failure rates. However, I see progress in increase intelligence density also. If I had a choice, I’d run models at 500b.
what do you think about Sehyo/Qwen3.5-122B-A10B-NVFP4 model?
it’s also support MTP and it was good quantized.
So do we think that MTP would also improve Qwen397B? Has there been any look into that?
Totally, I was able to run Qwen3.5-27B with MTP3 on R9700, but it only helped if I was running it in TP = 2 mode, lifted FP8 quant performance to about 30 tokens per second with. about 70% acceptance rate. But that was done by sort of calculating tokens based on a wall clock, because llama benchy calcs were lower in 12-13 t/s range, as was discussed earlier here. Anecdotally, based on vibes, just interacting with the model in OpenWebUI, the speed did not seem as good as what I was getting with Qwen397 running on dual sparks. Right now I’m running 122B on dual Sparks and I’m getting 45 tokens per second which is pretty feels pretty speedy and prompt caching is good, prompt processing is very excellent it really feels better than whatever Anthropic gives us, it feels almost as good as Codex in the fast mode. When I say it feels good, I’m just talking about the response speed and not about the quality of the response.
So I think my next project will be to see if MTP helps for Qwen397B. If we could bring it into 40 tokens per second range, that would be excellent.
Also I gained about 8% performance by using —no-ray option for 122B so I had to use a Eugr’s recipe instead of Sparkrun.
In github about NVFP4 Quantization (16.6 tok/s – 42% slower)
RedHatAI/Qwen3.5-122B-A10B-NVFP4 sounds great on paper. In practice: SM121 doesn’t have working FP4 CUTLASS kernels in vLLM yet, so it falls back to Marlin SM80 which handles FP4 poorly. Result: 16.6 tok/s, less than half our baseline. Waiting for vLLM PRs #38957 and #31607.
I am trying to check all options for Qwen3.5 MoE and MTP right now is the best: lossless and free tokens. But I found some other ways to get more tokens with lossless, as example, INT8 LM Head Triton. Theoretical, we can get about 100 tok/s for Qwen3.5 MoE 122b 10A.
could you please give the link or add to README.md instuctuon how to comile vLLM 0.19.x for SM121 please
You don’t need to compile vLLM from source manually — there’s a community project that handles everything: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub . This builds a Docker image with vLLM + FlashInfer compiled for SM121. It downloads prebuilt wheels from GitHub releases by default, so the build takes ~2-3 minutes (no source compilation needed).
If you want to build from a specific vLLM version:
./build-and-copy.sh --vllm-ref v0.19.1
Or build from source (takes 20-40 min first time):
./build-and-copy.sh --rebuild-vllm --rebuild-flashinfer
Regarding NVFP4 on SM121 — I only tested RedHatAI/Qwen3.5-122B-A10B-NVFP4, which gave 16.6 tok/s (42% slower than INT4 baseline). Haven’t tried Sehyo/Qwen3.5-122B-A10B-NVFP4 specifically, but the bottleneck is the same — SM121 doesn’t have native FP4 CUTLASS kernels in vLLM yet, so any NVFP4 model falls back to Marlin SM80 which handles FP4 poorly. Waiting on vLLM PRs #38957 and #31607. Once those land, NVFP4 might become competitive — but until then, INT4 AutoRound is the way to go on Spark.
So I’m testing on dual spark with MTP2 and MTP1. I’m getting regressions.
Config │ t/s │ Acceptance
no MTP │ 44.5 │ N/A
MTP-1 │ ~27 │ 82.8%
MTP-2 │ ~28 │ 68.6%
Claude is saying:
The PIECEWISE cudagraph mode (forced by FlashInfer + spec-decode) could be the culprit.
As @AoE mentioned, MTP doesn’t work with dual Spark. I have no 2nd Spark to review reason right now.