50%+ Improvement on spark?!

Dickson · March 14, 2026, 8:14pm

Just saw this on r/LocalLLaMa.

OP claims.

TL;DR: Built a custom CUTLASS kernel to fix SM120’s broken MoE GEMM tiles. Went from 55 tok/s (WSL2) → 119 (native Linux) → 142 (driver/config optimization) → 282 tok/s (custom K=64 kernel). PR submitted to FlashInfer, pre-built Docker image available.

The Problem

If you’re running NVFP4 MoE models (Qwen3.5-397B, DeepSeek, etc.) on RTX PRO 6000, RTX 5090, or DGX Spark — basically any SM120 Blackwell workstation GPU — you’ve probably seen this:
Failed to initialize cutlass TMA WS grouped gemm
The autotuner skips all the SM120 GEMM tiles because they overflow your GPU’s 99KB shared memory. Datacenter Blackwell (B200) has 228KB SMEM, so the tiles were designed for that. Your workstation GPU gets stuck on slow fallback kernels.

Result: You’re leaving 50%+ of your throughput on the table.

The fix is two files:

CUTLASS builder (sm120_blockscaled_mma_builder.inl) — the actual kernel fix

Codegen (generate_kernels.py) — enables K=64 tile generation for SM120

How can we adapt this to test out? If the claims are true then this is a very huge deal for spark owners. Paging @eugr :)

Link to post:

pontostroy · March 14, 2026, 8:26pm

Yes, I saw it too. The only concern is that the driver and CUDA versions are restricted for spark.

tenari · March 15, 2026, 12:03am

I’m trying to apply it to eugr’s spark vllm… wish me luck !

Dickson · March 15, 2026, 12:18am

Please keep us updated! GL

tenari · March 15, 2026, 12:35am

building now. had to extract cutlass changes from his docker image

tenari · March 15, 2026, 1:32am

I have it running. Sample queries are going OK. Benchmarks inboound

coder543 · March 15, 2026, 2:19am

I’m not sure what they’re claiming, but it is worth noting that specdec does not work for MoE models in batch size 1. This problem is inherent to the design of the models and to the number of tokens you can realistically speculate about with a normal draft model/draft head/MTP.

If someone was seeing a speedup for MoE specdec, then they were presumably batching large numbers of parallel requests.

tenari · March 15, 2026, 4:00am

I’m getting ~16 token/sec. Not great. I am not convinced I have it dialed in though, still testing and playing with enabling mtp (16 t/s is with MTP disabled, so doesn’t really seem better than the other nvfp4 hacks out there)

tenari · March 15, 2026, 5:09am

with MTP=2 (this seems optimal) I am getting about 32 tokens/sec sustained. It’s a big improvement. Going to make it my daily driver for a few days. The ollama benchy does NOT reflect this increase in performance it the actual numbers I get with that are abysmal. I’m probably doing something wrong

Dickson · March 15, 2026, 5:40am

Thank you for testing this. This looks really promising!

serapis · March 15, 2026, 10:01am

I tried monkey patching things but didn’t manage to get things up and running. Can you share your approach?

coder543 · March 15, 2026, 11:37am

If llama-benchy isn’t showing the performance improvement, then it is unlikely to be real.

tenari · March 15, 2026, 4:58pm

I’ll share a draft PR against eugr’s spark vllm repo when I’m able to get away from the kids. I’m new to spark so no promises that this is fully correct… but it does run and the perf, at least against real time queries seems a shade better than autoround

serapis · March 15, 2026, 5:06pm

That would be great. Thanks!

tenari · March 15, 2026, 5:07pm

github.com/eugr/spark-vllm-docker

Add FlashInfer PR patching support and K=64 SM120 CUTLASS fix

Commit by - Add FlashInfer PR patching and K=64 SM120 CUTLASS fix

main ← RobTand:flashinfer-pr-patching

- Add --apply-flashinfer-pr flag to build-and-copy.sh for applying FlashInfer …PRs at build time (mirrors existing --apply-vllm-pr) - Include K=64 SM120 CUTLASS patch for workstation Blackwell GPUs (ref: flashinfer-ai/flashinfer#2786) - Skip FlashInfer rebuild when wheels already exist - Add Qwen3.5 recipes (122B-A10B, 397B-A17B NVFP4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The flashinfer pr to use should be embedded in a comment

tenari · March 15, 2026, 8:48pm

there’s a fair amount of bellyaching about this on the reddit. People are calling it useless AI slop (paraphrasing). Not sure I completely agree. Btw, the patches I manually applied are things that forgot about (facepalm). One is a change to sm120_blockscaled_mma_builder.inl which he posed, and the other was a change that was upstreamed into cutlass 4.4.1. going to upgrade it, build it again, and rebase against eugr’s nightly vllm and see what hapepns. Not expecting a lot.

sesmanovic · March 16, 2026, 9:21am

# Qwen3.5-122B NVFP4 on a single DGX Spark — 19.6 tok/s with MTP=3

Tested Qwen3.5-122B-A10B-NVFP4 (Sehyo’s quant) on a single DGX Spark (GB10, SM 12.1, 128GB). Sharing results and setup in case it helps other Spark owners.

## Results

|------|--------|------|------------|

| Latency (short prompt) | 3 | 458ms | — |

| Throughput (200 tokens) | 200 | 10.2s | **19.6 tok/s** |

| Sustained (500 tokens) | 500 | 25.5s | **19.6 tok/s** |

Thinking OFF, MTP=3, single user. Zero crashes across all tests.

## Setup

- **Base**: [spark-vllm-docker PR #98]( Add FlashInfer PR patching and K=64 SM120 CUTLASS fix by RobTand · Pull Request #98 · eugr/spark-vllm-docker · GitHub ) (FlashInfer 0.6.6 + K=64 CUTLASS patch)

- **vLLM**: 0.17.1rc1.dev170

- **Driver**: 580.126.09, CUDA 13.1

## What I changed

### 1. `compute_121a` → `compute_121f`

The Dockerfile and `build-and-copy.sh` default to `12.1a` (accelerated feature subset). Changed to `12.1f` (full feature set) in both files. This requires CUDA 13.0+ which the Spark has. The `fused_moe_120.so` in the FlashInfer JIT cache wheel is then compiled with `-arch sm_121f` instead of `sm_121a`.

After rebuilding: you must delete the old FlashInfer wheels (`rm wheels/flashinfer*.whl`) before running `build-and-copy.sh --rebuild-flashinfer`, because the script skips the rebuild if wheels already exist (even with the flag).

### 2. MTP=3 (Multi-Token Prediction)

Added `–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:3}’` to the serve command. Qwen3.5 natively supports MTP (resolved as `Qwen3_5MoeMTP`). This is the biggest performance lever on a single Spark. MTP=3 is a safe choice for 128GB — MTP=5 might work but memory gets tight.

### 3. Memory-constrained config for 128GB

| Parameter | Original recipe | My config |

|-----------|----------------|-----------|

| `gpu_memory_utilization` | 0.85 | 0.80 |

| `max_model_len` | 262144 | 32768 |

| `max_num_batched_tokens` | 8192 | 4096 |

| `max_num_seqs` | 32 | 8 |

Model loads at ~76 GiB (including MTP heads). At 0.80, the KV cache gets 17 GiB — enough for 32K context and MTP=3. Peak memory during shard loading hits ~117GB, so 0.85 works but leaves very little headroom with MTP on top.

## The autotuner TMA WS skips — they’re expected

```

[Autotuner]: Skipping tactic 14 … Failed to initialize cutlass TMA WS grouped gemm (M128_BS_group4)

[Autotuner]: Skipping tactic 15 … Failed to initialize cutlass TMA WS grouped gemm (M256_BS_group1)

```

These are K=128 tiles that need ~228KB SMEM. The B200 has 228KB SMEM, the Spark’s GB10 has 99KB. They can’t physically run on our hardware. The K=64 patch (FlashInfer PR #2786) adds tiles that fit in 99KB, and the autotuner selects those instead. This is correct behavior, not a fallback bug.

Per the Reddit benchmarks from VOIPMonitor, the K=64 patch itself gives ~2-6% improvement. The real throughput gain comes from MTP.

## Launch command

```bash

./launch-cluster.sh -t vllm-node-tf5 --solo \

–apply-mod mods/fix-qwen3.5-autoround \

--apply-mod mods/fix-qwen3.5-chat-template \

-d exec vllm serve sehyo/Qwen3.5-122B-A10B-NVFP4 \

--max-model-len 32768 \

--gpu-memory-utilization 0.80 \\

--port 8000 --host 0.0.0.0 \

--enable-prefix-caching \\

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder \\

--reasoning-parser qwen3 \

--max-num-batched-tokens 4096 \\

--trust-remote-code \

--chat-template unsloth.jinja \\

-tp 1 --max-num-seqs 8 \

--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

```

## What’s next

Would love to hear from anyone running 2+ Sparks — with 256GB you could restore the full recipe params (262K context, max_num_seqs=32, MTP=5) and TP=2 over RoCE. That should push both throughput and context length significantly.

Also curious if Driver 595 + CUDA 13.2 makes a difference on the Spark like it reportedly does on RTX PRO 6000.

tenari · March 16, 2026, 1:19pm

Dude you rock. Want to update my PR? I’d like to pull all your changes and test again :)

tenari · March 16, 2026, 1:21pm

Also, in my testing, I found that MTP=2 seemed to yield better results? After each speculative call, vllm spits out the success ratio of each token. I was finding the third one was always (like 90% of the time) below 50% which I guess is the threshold for whether its utile or not.

coder543 · March 16, 2026, 1:25pm

If vLLM is counting each draft token and real token as tokens, it will spit out 2x speed, but the speed is actually the same. As I mentioned before, specdec (drafting) inherently does not work for batch size 1 MoEs. It is a limitation of the physics, not something that can be fixed with better code. Each additional token that is generated during specdec just has to be verified by the model, but verification means streaming the weights of the additional experts, and batch size 1 decode is bandwidth limited. There is no speedup because you just multiply the bandwidth needs.

Specdec can benefit production use cases for MoE models where you are batching large numbers of requests in parallel and you have a lot of extra compute capability on hand.

Please use llama-benchy to verify any performance improvements.

Topic		Replies	Views
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	838	April 16, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	431	21526	June 18, 2026
I am EXTREMely disappointed with the current state of DGX Spark DGX Spark / GB10	90	16909	June 17, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6027	March 16, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	263	June 19, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8751	March 14, 2026
DGX Spark: 13 → 49 tok/s with Qwen3.5-35B — Native SM121 Kernel Build Guide DGX Spark / GB10 Projects cuda , cusparse	13	1357	April 1, 2026
DGX Spark performance DGX Spark / GB10	49	6044	February 13, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	7562	February 24, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	252	16540	June 22, 2026

50%+ Improvement on spark?!

The Problem

Related topics