Step-3.7-Flash-NVFP4 with MTP on 2x DGX Spark — Standalone Recipe (262K ctx, ~32 tok/s)

MiaAI_Lab · June 12, 2026, 2:02pm

stepfun-ai/Step-3.7-Flash-NVFP4 across 2x DGX Spark, including MTP speculative decoding with grafted BF16 weights.

The official NVFP4 ModelOpt export strips the MTP (next-n predict) layers, so you can’t just point vLLM at the checkpoint and get speculation.
There is a recipe for step-3.7-flash in GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub , with MTP, but it doesn’t work ( Step 3.7 Flash Recipe Broken · Issue #284 · eugr/spark-vllm-docker · GitHub ), tried it myself.
I patched it myself and fixed it, and I put together a standalone repo that handles the grafting and serving end-to-end:

📦 ** GitHub - MiaAI-Lab/Dual-DGX-Spark-Step-3.7-Flash-NVFP4 · GitHub **

Setup:

2x DGX Spark (GB10), 128GB unified each
Direct 200G QSFP56 link between them (RoCE, CX-7)
Official vllm/vllm-openai:stepfun37 base image
No-Ray, TP=2 with PyTorch-native distributed

What the repo does differently:

MTP weight grafting — The graft-mtp.sh script downloads the MTP shards from the original BF16 checkpoint (stepfun-ai/Step-3.7-Flash), writes them as model-mtp.safetensors into the NVFP4 snapshot, registers them in the index, and
extends the truncated per-layer config lists. It runs on both nodes via SSH automatically.
vLLM MTP patch — Grafted MTP tensors are BF16, but the step3p5_mtp.py drafter inherits the model’s NVFP4 quant_config, which creates packed parameters of mismatched shapes. The launch script patches step3p5_mtp.py inside the
container (quant_config = None) before starting vLLM. It’s idempotent and doesn’t mutate the global config.
Standalone — No dependency on the spark-vllm-docker repo. One git clone and you’re off.
NCCL load-order fix — The launch templates symlink the system NCCL library so cross-node init works correctly with the official vLLM containers.

Build & launch (summary):

 git clone https://github.com/MiaAI-Lab/Dual-DGX-Spark-Step-3.7-Flash-NVFP4.git
 cd Dual-DGX-Spark-Step-3.7-Flash-NVFP4
 cp config.env.example config.env
 nano config.env               # set your IPs, interfaces
 ./setup.sh
 ./build-image.sh
 ./copy-image-to-worker.sh
 ./download-model.sh

 # Validate baseline first
 ./start.sh no-mtp
 ./test.sh
 ./stop.sh

 # Graft MTP weights and run with speculation
 ./graft-mtp.sh
 ./start.sh mtp
 ./test.sh

Numbers I’m seeing:

Baseline (no MTP): ~21–22 tok/s at 262K ctx, 8 concurrent
MTP with 3 speculative tokens: ~31–32 tok/s decode (warm, single stream), aggregate ~33 tok/s at low concurrency
Context: 262K max model len, 8192 max batched tokens
KV cache: FP8, GPU memory utilization 0.85

Key flags the recipe sets:

 --quantization modelopt
 --kv-cache-dtype fp8
 --disable-cascade-attn
 --disable-custom-all-reduce
 --no-enable-flashinfer-autotune
 --enable-auto-tool-choice
 --reasoning-parser step3p5
 --tool-call-parser step3p5
 --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Gotchas that cost me time:

MTP num_speculative_tokens must be divisible by 3. The stepfun37 image uses n_predict=3, so 4, 5, 7, etc. will fail. Use 3, 6, or 9. I found 3 is the sweet spot on GB10.
Always validate no-MTP first. If baseline doesn’t produce coherent text, MTP won’t either. Check the image version — the old vllm-node image produces BOS loops with this model.
The MTP patch is container-local. If you restart the container without ./start.sh mtp (which re-applies the patch), you’ll get RuntimeError: size of tensor a (2048) must match size of tensor b (4096).
Download the model on both nodes. ./download-model.sh handles this, but make sure both nodes have HF internet access or a shared cache.
NCCL link wedging. If you see mlx5: ACCESS_REG timeout during teardown/re-init, a cold reboot of the CX-7 was the only thing that cleared it for me.
No --spec-draft-p-min. The stepfun37 image doesn’t support it. Don’t add it unless you switch vLLM images.

What I’d still like to improve:

Pushing long-context cold prefill down (it’s the current bottleneck)
Higher concurrency at 262K (KV budget is tight — 8 seqs is about the max at this context length)
If someone has a cleaner approach to the MTP grafting (without patching the vLLM source at runtime), I’d love to hear it.

Hope this saves someone else the time I spent figuring out the MTP grafting puzzle. Questions, PRs, and benchmarks welcome!

You’re also welcome to follow me on X - Mia (@MiaAI_lab) / X

serapis · June 12, 2026, 4:52pm

I took a look at your config and tried a few things to see what may cause the MTP issue and how to solve it.

The key here is to use the official vLLM image (vllm/vllm-openai:stepfun37) over the latest vLLM release – I haven’t fully analyzed the differences between the images but it seems like that causes the lack of coherence and MTP issues you and I encountered. I did not have to patch MTP to make it work.

I played with two other MTP modes and found the performance to be identical:

MTP via a dedicated draft model based on the BF16 release (https://huggingface.co/Hikari07jp/Step-3.7-Flash-MTP-draft)

| model                           |           test |             t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------------------------------|---------------:|----------------:|-------------:|----------------:|----------------:|----------------:|
| stepfun-ai/Step-3.7-Flash-NVFP4 |         pp2048 | 2464.71 ± 19.64 |              |   837.25 ± 6.81 |   834.10 ± 6.81 |   837.25 ± 6.81 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |          tg128 |    29.51 ± 1.10 | 35.00 ± 0.82 |                 |                 |                 |
| stepfun-ai/Step-3.7-Flash-NVFP4 | pp2048 @ d4096 | 3311.48 ± 32.35 |              | 1860.71 ± 18.50 | 1857.56 ± 18.50 | 1860.71 ± 18.50 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |  tg128 @ d4096 |    28.24 ± 1.75 | 35.33 ± 2.87 |                 |                 |                 |
| stepfun-ai/Step-3.7-Flash-NVFP4 | pp2048 @ d8192 | 3433.19 ± 24.76 |              | 2988.00 ± 21.65 | 2984.84 ± 21.65 | 2988.00 ± 21.65 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |  tg128 @ d8192 |    30.00 ± 1.10 | 39.00 ± 0.82 |                 |                 |                 |

llama-benchy (0.3.8.dev4+gece1fa650)
date: 2026-06-12 18:28:16 | latency mode: api

MTP via the NVFP4 quant (https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4/commit/4275532ffd9a9496ff36b7a2dc4a9db1048da438)

| model                           |           test |              t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:--------------------------------|---------------:|-----------------:|-------------:|----------------:|----------------:|----------------:|
| stepfun-ai/Step-3.7-Flash-NVFP4 |         pp2048 | 2593.64 ± 120.78 |              |  797.40 ± 35.82 |  794.38 ± 35.82 |  797.40 ± 35.82 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |          tg128 |     29.17 ± 1.79 | 36.67 ± 0.47 |                 |                 |                 |
| stepfun-ai/Step-3.7-Flash-NVFP4 | pp2048 @ d4096 |  3294.38 ± 62.65 |              | 1871.12 ± 35.06 | 1868.09 ± 35.06 | 1871.12 ± 35.06 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |  tg128 @ d4096 |     23.68 ± 0.22 | 31.00 ± 0.00 |                 |                 |                 |
| stepfun-ai/Step-3.7-Flash-NVFP4 | pp2048 @ d8192 |  3405.38 ± 34.17 |              | 3012.09 ± 29.75 | 3009.07 ± 29.75 | 3012.09 ± 29.75 |
| stepfun-ai/Step-3.7-Flash-NVFP4 |  tg128 @ d8192 |     27.96 ± 0.95 | 35.67 ± 0.47 |                 |                 |                 |

llama-benchy (0.3.8.dev4+gece1fa650)
date: 2026-06-12 18:41:06 | latency mode: api

Model settings for both:

tensor_parallel_size: 2
gpu_memory_utilization: 0.79
max_model_len: 262144
max_num_seqs: 4
max_num_batched_tokens: 16384
quantization: modelopt
kv_cache_dtype: fp8
tool_call_parser: step3p5
reasoning_parser: step3p5

Thanks again for patiently answering on Twitter and publishing this recipe!

MiaAI_Lab · June 12, 2026, 4:55pm

Sure thing! It drove me nuts but it’s finally working. I’m satisfied.

eugr_nv · June 12, 2026, 5:13pm

JFYI, there is a merged fix for MTP in Flashinfer, but vLLM side doesn’t include it, I will see if I can apply a quick patch or there is an open PR for that now.

serapis · June 12, 2026, 5:20pm

I’d be happy to test that to see if that can help bring me back to the main release. I’d prefer to benefit from the regular spark-vllm-docker infrastructure and build process if possible.

vasimv · June 12, 2026, 11:06pm

On single spark with llama.cpp (build with cuda toolkit 13.3 + 595 drivers), IQ4-XS version, no MTP:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Step-3.7-flash-IQ4_XS-00001-of-00003.gguf	pp16384	743.78 ± 1.34		19874.46 ± 66.28	19872.05 ± 66.28	19874.46 ± 66.28
Step-3.7-flash-IQ4_XS-00001-of-00003.gguf	tg128	27.85 ± 0.16	29.33 ± 0.47

0rand · June 16, 2026, 7:00pm

How is your experience with Step Fun, Mia? Is it better at coding than DS4F? I am looking for a model for a second option/audit function, 262k is okay as I would run it on summarized reports.

MiaAI_Lab · June 16, 2026, 7:23pm

I think it’s about on par with DS4F, but with DS I get much more context and tok/s is better, so I usually stick with that unless I need image support.

0rand · June 16, 2026, 7:24pm

Any opinion of Minimax 2.7? For the same reason

MiaAI_Lab · June 16, 2026, 7:37pm

Haven’t tried it… man I need more dgx sparks

0rand · June 17, 2026, 7:47pm

Set it up today. BTW your repo is bit datad, even download script does not work with new huggingface cli. Also nccl setup is incomplete. But a good starting point, thanks. I got it running with grafted mtp at 30 tps for 1 seq. Decent speed. But quality wise it’s weak. 84 on hard mode, vs 88-89 ds4f, and in my own custom bencher designed for my specific tasks it’s on par with gemma 12b, pretty funny. But it’s because 12b is unbelievable good for a small model. Real gem. Unlike 26b. Sorta like 27b for qwen. But fast unlike 27b (of course, large amount of actives). Net result: we are blessed by Deepseek, there might be no better model for us for a long while, this good and performant.

MiaAI_Lab · June 17, 2026, 8:32pm

Thanks, will look info the scripts. But yeah, DS4F is my go-to for a reason, it’s unbeatable currently on our sparks.

jc2375 · June 18, 2026, 9:32pm

would you say DSV4F is better or worse than Stepfun for agentic use? Hermes specifically. DS4 is working very well, but I’m wondering about that sweet sweet vision for my agent’s main model

You already answered my question. DS4F for the win, it seems!

jc2375 · June 18, 2026, 9:36pm

Yes, gemma4-12b is a sleeper hit. I have not tried the vision capabilities as well, but audio/ASR is insane with this model, it has become the auxilliary agent for deepseek doing compacting, web search, checking tool calls, transcribing voice, etc. 265k native, quantizes cache very well, fits on 16GB with a llama.cpp instance, and its blazing fast.

Actually, re: vision – have you tried using it for OCR/documents/image understanding?

MiaAI_Lab · June 18, 2026, 9:49pm

only if you really need vision, which is fair

jc2375 · June 18, 2026, 10:03pm

My current set up includes a strix halo running qwen35b and two eGPUs with gemma4-12b and qwen27b, as well as the dual sparks. I like slimming it all down to a single model, but for now, I will continue to use the skinny qwens for vision tasks in hermes :) DS4F is too good to not have it as the main agent model!

jwarner · June 18, 2026, 10:33pm

Look past tool eval bench as the ultimate source of quality. Audit the log for this model and ones like Gemma4 and you’ll find many reasonable responses which in times of uncertainty may opt to defer to the user for clarification - these get marked failures, but do not indicate agentic failures or lack of tool calling stability. That benchmark can be improved.

Step-3.7-Flash is one of the best models out there pound for pound when it comes to postgraduate level analysis and surprisingly great domain knowledge.

0rand · June 18, 2026, 10:36pm

Well I ran my own domain knowledge test and like I said I was on gemma 12b level.

MiaAI_Lab · June 19, 2026, 6:13am

I see no clear wins for step-3.7-flash vs ds4f. Would be happy so see what did you found out.

jwarner · June 20, 2026, 11:23pm

I can only run DS4-Flash via DwarfStar using the hybrid 2-bit quant today, so that may bias me. DS4 seems less reliable than Step-2.7-Flash but that could be due to the latter fitting on single GB10 with the IQ4_XS quant. Point being, this isn’t apples to apples.

I’m working on getting a 2nd GB10 and the full released DS4-Flash is one of the big reasons. Then both should be deployable via vLLM.

Topic		Replies	Views
Step-3.7-Flash is supported in community Docker on DGX Spark! DGX Spark / GB10	53	3695	June 8, 2026
Step-3.7-Flash on single Spark (llama.cpp only) DGX Spark / GB10 Projects llama	12	1276	June 5, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8717	March 14, 2026
Running Step-3.5-Flash on Single Spark DGX Spark / GB10 Projects jetson , llama	20	3102	February 9, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2629	December 25, 2025
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	230	June 19, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	250	16072	June 19, 2026
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	5989	June 15, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3226	December 17, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	12910	May 15, 2026

Step-3.7-Flash-NVFP4 with MTP on 2x DGX Spark — Standalone Recipe (262K ctx, ~32 tok/s)

Related topics