Step-3.7-Flash-NVFP4 with MTP on 2x DGX Spark — Standalone Recipe (262K ctx, ~32 tok/s)

stepfun-ai/Step-3.7-Flash-NVFP4 across 2x DGX Spark, including MTP speculative decoding with grafted BF16 weights.

The official NVFP4 ModelOpt export strips the MTP (next-n predict) layers, so you can’t just point vLLM at the checkpoint and get speculation.
There is a recipe for step-3.7-flash in GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub , with MTP, but it doesn’t work ( Step 3.7 Flash Recipe Broken · Issue #284 · eugr/spark-vllm-docker · GitHub ), tried it myself.
I patched it myself and fixed it, and I put together a standalone repo that handles the grafting and serving end-to-end:

📦 ** GitHub - MiaAI-Lab/Dual-DGX-Spark-Step-3.7-Flash-NVFP4 · GitHub **

Setup:

  • 2x DGX Spark (GB10), 128GB unified each
  • Direct 200G QSFP56 link between them (RoCE, CX-7)
  • Official vllm/vllm-openai:stepfun37 base image
  • No-Ray, TP=2 with PyTorch-native distributed

What the repo does differently:

  1. MTP weight grafting — The graft-mtp.sh script downloads the MTP shards from the original BF16 checkpoint (stepfun-ai/Step-3.7-Flash), writes them as model-mtp.safetensors into the NVFP4 snapshot, registers them in the index, and
    extends the truncated per-layer config lists. It runs on both nodes via SSH automatically.
  2. vLLM MTP patch — Grafted MTP tensors are BF16, but the step3p5_mtp.py drafter inherits the model’s NVFP4 quant_config, which creates packed parameters of mismatched shapes. The launch script patches step3p5_mtp.py inside the
    container (quant_config = None) before starting vLLM. It’s idempotent and doesn’t mutate the global config.
  3. Standalone — No dependency on the spark-vllm-docker repo. One git clone and you’re off.
  4. NCCL load-order fix — The launch templates symlink the system NCCL library so cross-node init works correctly with the official vLLM containers.

Build & launch (summary):

 git clone https://github.com/MiaAI-Lab/Dual-DGX-Spark-Step-3.7-Flash-NVFP4.git
 cd Dual-DGX-Spark-Step-3.7-Flash-NVFP4
 cp config.env.example config.env
 nano config.env               # set your IPs, interfaces
 ./setup.sh
 ./build-image.sh
 ./copy-image-to-worker.sh
 ./download-model.sh

 # Validate baseline first
 ./start.sh no-mtp
 ./test.sh
 ./stop.sh

 # Graft MTP weights and run with speculation
 ./graft-mtp.sh
 ./start.sh mtp
 ./test.sh

Numbers I’m seeing:

  • Baseline (no MTP): ~21–22 tok/s at 262K ctx, 8 concurrent
  • MTP with 3 speculative tokens: ~31–32 tok/s decode (warm, single stream), aggregate ~33 tok/s at low concurrency
  • Context: 262K max model len, 8192 max batched tokens
  • KV cache: FP8, GPU memory utilization 0.85

Key flags the recipe sets:

 --quantization modelopt
 --kv-cache-dtype fp8
 --disable-cascade-attn
 --disable-custom-all-reduce
 --no-enable-flashinfer-autotune
 --enable-auto-tool-choice
 --reasoning-parser step3p5
 --tool-call-parser step3p5
 --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Gotchas that cost me time:

  • MTP num_speculative_tokens must be divisible by 3. The stepfun37 image uses n_predict=3, so 4, 5, 7, etc. will fail. Use 3, 6, or 9. I found 3 is the sweet spot on GB10.
  • Always validate no-MTP first. If baseline doesn’t produce coherent text, MTP won’t either. Check the image version — the old vllm-node image produces BOS loops with this model.
  • The MTP patch is container-local. If you restart the container without ./start.sh mtp (which re-applies the patch), you’ll get RuntimeError: size of tensor a (2048) must match size of tensor b (4096).
  • Download the model on both nodes. ./download-model.sh handles this, but make sure both nodes have HF internet access or a shared cache.
  • NCCL link wedging. If you see mlx5: ACCESS_REG timeout during teardown/re-init, a cold reboot of the CX-7 was the only thing that cleared it for me.
  • No --spec-draft-p-min. The stepfun37 image doesn’t support it. Don’t add it unless you switch vLLM images.

What I’d still like to improve:

  • Pushing long-context cold prefill down (it’s the current bottleneck)
  • Higher concurrency at 262K (KV budget is tight — 8 seqs is about the max at this context length)
  • If someone has a cleaner approach to the MTP grafting (without patching the vLLM source at runtime), I’d love to hear it.

Hope this saves someone else the time I spent figuring out the MTP grafting puzzle. Questions, PRs, and benchmarks welcome!

You’re also welcome to follow me on X - Mia (@MiaAI_lab) / X

I took a look at your config and tried a few things to see what may cause the MTP issue and how to solve it.

The key here is to use the official vLLM image (vllm/vllm-openai:stepfun37) over the latest vLLM release – I haven’t fully analyzed the differences between the images but it seems like that causes the lack of coherence and MTP issues you and I encountered. I did not have to patch MTP to make it work.

I played with two other MTP modes and found the performance to be identical:

MTP via a dedicated draft model based on the BF16 release (https://huggingface.co/Hikari07jp/Step-3.7-Flash-MTP-draft)

| model                           |           test |             t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------------------------------|---------------:|----------------:|-------------:|----------------:|----------------:|----------------:|
| stepfun-ai/Step-3.7-Flash-NVFP4 |         pp2048 | 2464.71 ± 19.64 |              |   837.25 ± 6.81 |   834.10 ± 6.81 |   837.25 ± 6.81 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |          tg128 |    29.51 ± 1.10 | 35.00 ± 0.82 |                 |                 |                 |
| stepfun-ai/Step-3.7-Flash-NVFP4 | pp2048 @ d4096 | 3311.48 ± 32.35 |              | 1860.71 ± 18.50 | 1857.56 ± 18.50 | 1860.71 ± 18.50 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |  tg128 @ d4096 |    28.24 ± 1.75 | 35.33 ± 2.87 |                 |                 |                 |
| stepfun-ai/Step-3.7-Flash-NVFP4 | pp2048 @ d8192 | 3433.19 ± 24.76 |              | 2988.00 ± 21.65 | 2984.84 ± 21.65 | 2988.00 ± 21.65 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |  tg128 @ d8192 |    30.00 ± 1.10 | 39.00 ± 0.82 |                 |                 |                 |

llama-benchy (0.3.8.dev4+gece1fa650)
date: 2026-06-12 18:28:16 | latency mode: api

MTP via the NVFP4 quant (https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4/commit/4275532ffd9a9496ff36b7a2dc4a9db1048da438)

| model                           |           test |              t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------------------------------|---------------:|-----------------:|-------------:|----------------:|----------------:|----------------:|
| stepfun-ai/Step-3.7-Flash-NVFP4 |         pp2048 | 2593.64 ± 120.78 |              |  797.40 ± 35.82 |  794.38 ± 35.82 |  797.40 ± 35.82 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |          tg128 |     29.17 ± 1.79 | 36.67 ± 0.47 |                 |                 |                 |
| stepfun-ai/Step-3.7-Flash-NVFP4 | pp2048 @ d4096 |  3294.38 ± 62.65 |              | 1871.12 ± 35.06 | 1868.09 ± 35.06 | 1871.12 ± 35.06 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |  tg128 @ d4096 |     23.68 ± 0.22 | 31.00 ± 0.00 |                 |                 |                 |
| stepfun-ai/Step-3.7-Flash-NVFP4 | pp2048 @ d8192 |  3405.38 ± 34.17 |              | 3012.09 ± 29.75 | 3009.07 ± 29.75 | 3012.09 ± 29.75 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |  tg128 @ d8192 |     27.96 ± 0.95 | 35.67 ± 0.47 |                 |                 |                 |

llama-benchy (0.3.8.dev4+gece1fa650)
date: 2026-06-12 18:41:06 | latency mode: api

Model settings for both:

tensor_parallel_size: 2
gpu_memory_utilization: 0.79
max_model_len: 262144
max_num_seqs: 4
max_num_batched_tokens: 16384
quantization: modelopt
kv_cache_dtype: fp8
tool_call_parser: step3p5
reasoning_parser: step3p5

Thanks again for patiently answering on Twitter and publishing this recipe!

Sure thing! It drove me nuts but it’s finally working. I’m satisfied.

JFYI, there is a merged fix for MTP in Flashinfer, but vLLM side doesn’t include it, I will see if I can apply a quick patch or there is an open PR for that now.

I’d be happy to test that to see if that can help bring me back to the main release. I’d prefer to benefit from the regular spark-vllm-docker infrastructure and build process if possible.

On single spark with llama.cpp (build with cuda toolkit 13.3 + 595 drivers), IQ4-XS version, no MTP:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Step-3.7-flash-IQ4_XS-00001-of-00003.gguf pp16384 743.78 ± 1.34 19874.46 ± 66.28 19872.05 ± 66.28 19874.46 ± 66.28
Step-3.7-flash-IQ4_XS-00001-of-00003.gguf tg128 27.85 ± 0.16 29.33 ± 0.47

How is your experience with Step Fun, Mia? Is it better at coding than DS4F? I am looking for a model for a second option/audit function, 262k is okay as I would run it on summarized reports.

I think it’s about on par with DS4F, but with DS I get much more context and tok/s is better, so I usually stick with that unless I need image support.

Any opinion of Minimax 2.7? For the same reason

Haven’t tried it… man I need more dgx sparks

Set it up today. BTW your repo is bit datad, even download script does not work with new huggingface cli. Also nccl setup is incomplete. But a good starting point, thanks. I got it running with grafted mtp at 30 tps for 1 seq. Decent speed. But quality wise it’s weak. 84 on hard mode, vs 88-89 ds4f, and in my own custom bencher designed for my specific tasks it’s on par with gemma 12b, pretty funny. But it’s because 12b is unbelievable good for a small model. Real gem. Unlike 26b. Sorta like 27b for qwen. But fast unlike 27b (of course, large amount of actives). Net result: we are blessed by Deepseek, there might be no better model for us for a long while, this good and performant.

Thanks, will look info the scripts. But yeah, DS4F is my go-to for a reason, it’s unbeatable currently on our sparks.

would you say DSV4F is better or worse than Stepfun for agentic use? Hermes specifically. DS4 is working very well, but I’m wondering about that sweet sweet vision for my agent’s main model

You already answered my question. DS4F for the win, it seems!

Yes, gemma4-12b is a sleeper hit. I have not tried the vision capabilities as well, but audio/ASR is insane with this model, it has become the auxilliary agent for deepseek doing compacting, web search, checking tool calls, transcribing voice, etc. 265k native, quantizes cache very well, fits on 16GB with a llama.cpp instance, and its blazing fast.

Actually, re: vision – have you tried using it for OCR/documents/image understanding?

only if you really need vision, which is fair

My current set up includes a strix halo running qwen35b and two eGPUs with gemma4-12b and qwen27b, as well as the dual sparks. I like slimming it all down to a single model, but for now, I will continue to use the skinny qwens for vision tasks in hermes :) DS4F is too good to not have it as the main agent model!

Look past tool eval bench as the ultimate source of quality. Audit the log for this model and ones like Gemma4 and you’ll find many reasonable responses which in times of uncertainty may opt to defer to the user for clarification - these get marked failures, but do not indicate agentic failures or lack of tool calling stability. That benchmark can be improved.

Step-3.7-Flash is one of the best models out there pound for pound when it comes to postgraduate level analysis and surprisingly great domain knowledge.

Well I ran my own domain knowledge test and like I said I was on gemma 12b level.

I see no clear wins for step-3.7-flash vs ds4f. Would be happy so see what did you found out.

I can only run DS4-Flash via DwarfStar using the hybrid 2-bit quant today, so that may bias me. DS4 seems less reliable than Step-2.7-Flash but that could be due to the latter fitting on single GB10 with the IQ4_XS quant. Point being, this isn’t apples to apples.

I’m working on getting a 2nd GB10 and the full released DS4-Flash is one of the big reasons. Then both should be deployable via vLLM.