MTP weight grafting — The graft-mtp.sh script downloads the MTP shards from the original BF16 checkpoint (stepfun-ai/Step-3.7-Flash), writes them as model-mtp.safetensors into the NVFP4 snapshot, registers them in the index, and
extends the truncated per-layer config lists. It runs on both nodes via SSH automatically.
vLLM MTP patch — Grafted MTP tensors are BF16, but the step3p5_mtp.py drafter inherits the model’s NVFP4 quant_config, which creates packed parameters of mismatched shapes. The launch script patches step3p5_mtp.py inside the
container (quant_config = None) before starting vLLM. It’s idempotent and doesn’t mutate the global config.
Standalone — No dependency on the spark-vllm-docker repo. One git clone and you’re off.
NCCL load-order fix — The launch templates symlink the system NCCL library so cross-node init works correctly with the official vLLM containers.
Build & launch (summary):
git clone https://github.com/MiaAI-Lab/Dual-DGX-Spark-Step-3.7-Flash-NVFP4.git
cd Dual-DGX-Spark-Step-3.7-Flash-NVFP4
cp config.env.example config.env
nano config.env # set your IPs, interfaces
./setup.sh
./build-image.sh
./copy-image-to-worker.sh
./download-model.sh
# Validate baseline first
./start.sh no-mtp
./test.sh
./stop.sh
# Graft MTP weights and run with speculation
./graft-mtp.sh
./start.sh mtp
./test.sh
Numbers I’m seeing:
Baseline (no MTP): ~21–22 tok/s at 262K ctx, 8 concurrent
MTP with 3 speculative tokens: ~31–32 tok/s decode (warm, single stream), aggregate ~33 tok/s at low concurrency
Context: 262K max model len, 8192 max batched tokens
MTP num_speculative_tokens must be divisible by 3. The stepfun37 image uses n_predict=3, so 4, 5, 7, etc. will fail. Use 3, 6, or 9. I found 3 is the sweet spot on GB10.
Always validate no-MTP first. If baseline doesn’t produce coherent text, MTP won’t either. Check the image version — the old vllm-node image produces BOS loops with this model.
The MTP patch is container-local. If you restart the container without ./start.sh mtp (which re-applies the patch), you’ll get RuntimeError: size of tensor a (2048) must match size of tensor b (4096).
Download the model on both nodes. ./download-model.sh handles this, but make sure both nodes have HF internet access or a shared cache.
NCCL link wedging. If you see mlx5: ACCESS_REG timeout during teardown/re-init, a cold reboot of the CX-7 was the only thing that cleared it for me.
No --spec-draft-p-min. The stepfun37 image doesn’t support it. Don’t add it unless you switch vLLM images.
What I’d still like to improve:
Pushing long-context cold prefill down (it’s the current bottleneck)
Higher concurrency at 262K (KV budget is tight — 8 seqs is about the max at this context length)
If someone has a cleaner approach to the MTP grafting (without patching the vLLM source at runtime), I’d love to hear it.
Hope this saves someone else the time I spent figuring out the MTP grafting puzzle. Questions, PRs, and benchmarks welcome!
I took a look at your config and tried a few things to see what may cause the MTP issue and how to solve it.
The key here is to use the official vLLM image (vllm/vllm-openai:stepfun37) over the latest vLLM release – I haven’t fully analyzed the differences between the images but it seems like that causes the lack of coherence and MTP issues you and I encountered. I did not have to patch MTP to make it work.
I played with two other MTP modes and found the performance to be identical:
JFYI, there is a merged fix for MTP in Flashinfer, but vLLM side doesn’t include it, I will see if I can apply a quick patch or there is an open PR for that now.
I’d be happy to test that to see if that can help bring me back to the main release. I’d prefer to benefit from the regular spark-vllm-docker infrastructure and build process if possible.
How is your experience with Step Fun, Mia? Is it better at coding than DS4F? I am looking for a model for a second option/audit function, 262k is okay as I would run it on summarized reports.
I think it’s about on par with DS4F, but with DS I get much more context and tok/s is better, so I usually stick with that unless I need image support.
Set it up today. BTW your repo is bit datad, even download script does not work with new huggingface cli. Also nccl setup is incomplete. But a good starting point, thanks. I got it running with grafted mtp at 30 tps for 1 seq. Decent speed. But quality wise it’s weak. 84 on hard mode, vs 88-89 ds4f, and in my own custom bencher designed for my specific tasks it’s on par with gemma 12b, pretty funny. But it’s because 12b is unbelievable good for a small model. Real gem. Unlike 26b. Sorta like 27b for qwen. But fast unlike 27b (of course, large amount of actives). Net result: we are blessed by Deepseek, there might be no better model for us for a long while, this good and performant.
would you say DSV4F is better or worse than Stepfun for agentic use? Hermes specifically. DS4 is working very well, but I’m wondering about that sweet sweet vision for my agent’s main model
You already answered my question. DS4F for the win, it seems!
Yes, gemma4-12b is a sleeper hit. I have not tried the vision capabilities as well, but audio/ASR is insane with this model, it has become the auxilliary agent for deepseek doing compacting, web search, checking tool calls, transcribing voice, etc. 265k native, quantizes cache very well, fits on 16GB with a llama.cpp instance, and its blazing fast.
Actually, re: vision – have you tried using it for OCR/documents/image understanding?
My current set up includes a strix halo running qwen35b and two eGPUs with gemma4-12b and qwen27b, as well as the dual sparks. I like slimming it all down to a single model, but for now, I will continue to use the skinny qwens for vision tasks in hermes :) DS4F is too good to not have it as the main agent model!
Look past tool eval bench as the ultimate source of quality. Audit the log for this model and ones like Gemma4 and you’ll find many reasonable responses which in times of uncertainty may opt to defer to the user for clarification - these get marked failures, but do not indicate agentic failures or lack of tool calling stability. That benchmark can be improved.
Step-3.7-Flash is one of the best models out there pound for pound when it comes to postgraduate level analysis and surprisingly great domain knowledge.
I can only run DS4-Flash via DwarfStar using the hybrid 2-bit quant today, so that may bias me. DS4 seems less reliable than Step-2.7-Flash but that could be due to the latter fitting on single GB10 with the IQ4_XS quant. Point being, this isn’t apples to apples.
I’m working on getting a 2nd GB10 and the full released DS4-Flash is one of the big reasons. Then both should be deployable via vLLM.