50%+ Improvement on spark?!

Just saw this on r/LocalLLaMa.

OP claims.

TL;DR: Built a custom CUTLASS kernel to fix SM120’s broken MoE GEMM tiles. Went from 55 tok/s (WSL2) → 119 (native Linux) → 142 (driver/config optimization) → 282 tok/s (custom K=64 kernel). PR submitted to FlashInfer, pre-built Docker image available.

The Problem

If you’re running NVFP4 MoE models (Qwen3.5-397B, DeepSeek, etc.) on RTX PRO 6000, RTX 5090, or DGX Spark — basically any SM120 Blackwell workstation GPU — you’ve probably seen this:

Failed to initialize cutlass TMA WS grouped gemm

The autotuner skips all the SM120 GEMM tiles because they overflow your GPU’s 99KB shared memory. Datacenter Blackwell (B200) has 228KB SMEM, so the tiles were designed for that. Your workstation GPU gets stuck on slow fallback kernels.

Result: You’re leaving 50%+ of your throughput on the table.

The fix is two files:

  1. CUTLASS builder (sm120_blockscaled_mma_builder.inl) — the actual kernel fix

  2. Codegen (generate_kernels.py) — enables K=64 tile generation for SM120

How can we adapt this to test out? If the claims are true then this is a very huge deal for spark owners. Paging @eugr :)

Link to post:

Yes, I saw it too. The only concern is that the driver and CUDA versions are restricted for spark.

I’m trying to apply it to eugr’s spark vllm… wish me luck !

Please keep us updated! GL

building now. had to extract cutlass changes from his docker image

I have it running. Sample queries are going OK. Benchmarks inboound

I’m not sure what they’re claiming, but it is worth noting that specdec does not work for MoE models in batch size 1. This problem is inherent to the design of the models and to the number of tokens you can realistically speculate about with a normal draft model/draft head/MTP.

If someone was seeing a speedup for MoE specdec, then they were presumably batching large numbers of parallel requests.

I’m getting ~16 token/sec. Not great. I am not convinced I have it dialed in though, still testing and playing with enabling mtp (16 t/s is with MTP disabled, so doesn’t really seem better than the other nvfp4 hacks out there)

with MTP=2 (this seems optimal) I am getting about 32 tokens/sec sustained. It’s a big improvement. Going to make it my daily driver for a few days. The ollama benchy does NOT reflect this increase in performance it the actual numbers I get with that are abysmal. I’m probably doing something wrong

Thank you for testing this. This looks really promising!

I tried monkey patching things but didn’t manage to get things up and running. Can you share your approach?

If llama-benchy isn’t showing the performance improvement, then it is unlikely to be real.

I’ll share a draft PR against eugr’s spark vllm repo when I’m able to get away from the kids. I’m new to spark so no promises that this is fully correct… but it does run and the perf, at least against real time queries seems a shade better than autoround

That would be great. Thanks!

The flashinfer pr to use should be embedded in a comment

there’s a fair amount of bellyaching about this on the reddit. People are calling it useless AI slop (paraphrasing). Not sure I completely agree. Btw, the patches I manually applied are things that forgot about (facepalm). One is a change to sm120_blockscaled_mma_builder.inl which he posed, and the other was a change that was upstreamed into cutlass 4.4.1. going to upgrade it, build it again, and rebase against eugr’s nightly vllm and see what hapepns. Not expecting a lot.

# Qwen3.5-122B NVFP4 on a single DGX Spark — 19.6 tok/s with MTP=3

Tested Qwen3.5-122B-A10B-NVFP4 (Sehyo’s quant) on a single DGX Spark (GB10, SM 12.1, 128GB). Sharing results and setup in case it helps other Spark owners.

## Results

| Test | Tokens | Time | Throughput |

|------|--------|------|------------|

| Latency (short prompt) | 3 | 458ms | — |

| Throughput (200 tokens) | 200 | 10.2s | **19.6 tok/s** |

| Sustained (500 tokens) | 500 | 25.5s | **19.6 tok/s** |

Thinking OFF, MTP=3, single user. Zero crashes across all tests.

## Setup

- **Base**: [spark-vllm-docker PR #98]( Add FlashInfer PR patching and K=64 SM120 CUTLASS fix by RobTand · Pull Request #98 · eugr/spark-vllm-docker · GitHub ) (FlashInfer 0.6.6 + K=64 CUTLASS patch)

- **vLLM**: 0.17.1rc1.dev170

- **Driver**: 580.126.09, CUDA 13.1

## What I changed

### 1. `compute_121a` → `compute_121f`

The Dockerfile and `build-and-copy.sh` default to `12.1a` (accelerated feature subset). Changed to `12.1f` (full feature set) in both files. This requires CUDA 13.0+ which the Spark has. The `fused_moe_120.so` in the FlashInfer JIT cache wheel is then compiled with `-arch sm_121f` instead of `sm_121a`.

After rebuilding: you must delete the old FlashInfer wheels (`rm wheels/flashinfer*.whl`) before running `build-and-copy.sh --rebuild-flashinfer`, because the script skips the rebuild if wheels already exist (even with the flag).

### 2. MTP=3 (Multi-Token Prediction)

Added `–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:3}’` to the serve command. Qwen3.5 natively supports MTP (resolved as `Qwen3_5MoeMTP`). This is the biggest performance lever on a single Spark. MTP=3 is a safe choice for 128GB — MTP=5 might work but memory gets tight.

### 3. Memory-constrained config for 128GB

| Parameter | Original recipe | My config |

|-----------|----------------|-----------|

| `gpu_memory_utilization` | 0.85 | 0.80 |

| `max_model_len` | 262144 | 32768 |

| `max_num_batched_tokens` | 8192 | 4096 |

| `max_num_seqs` | 32 | 8 |

Model loads at ~76 GiB (including MTP heads). At 0.80, the KV cache gets 17 GiB — enough for 32K context and MTP=3. Peak memory during shard loading hits ~117GB, so 0.85 works but leaves very little headroom with MTP on top.

## The autotuner TMA WS skips — they’re expected

```

[Autotuner]: Skipping tactic 14 … Failed to initialize cutlass TMA WS grouped gemm (M128_BS_group4)

[Autotuner]: Skipping tactic 15 … Failed to initialize cutlass TMA WS grouped gemm (M256_BS_group1)

```

These are K=128 tiles that need ~228KB SMEM. The B200 has 228KB SMEM, the Spark’s GB10 has 99KB. They can’t physically run on our hardware. The K=64 patch (FlashInfer PR #2786) adds tiles that fit in 99KB, and the autotuner selects those instead. This is correct behavior, not a fallback bug.

Per the Reddit benchmarks from VOIPMonitor, the K=64 patch itself gives ~2-6% improvement. The real throughput gain comes from MTP.

## Launch command

```bash

./launch-cluster.sh -t vllm-node-tf5 --solo \

–apply-mod mods/fix-qwen3.5-autoround \

--apply-mod mods/fix-qwen3.5-chat-template \

-d exec vllm serve sehyo/Qwen3.5-122B-A10B-NVFP4 \

--max-model-len 32768 \

--gpu-memory-utilization 0.80 \\

--port 8000 --host 0.0.0.0 \

--enable-prefix-caching \\

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder \\

--reasoning-parser qwen3 \

--max-num-batched-tokens 4096 \\

--trust-remote-code \

--chat-template unsloth.jinja \\

-tp 1 --max-num-seqs 8 \

--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

```

## What’s next

Would love to hear from anyone running 2+ Sparks — with 256GB you could restore the full recipe params (262K context, max_num_seqs=32, MTP=5) and TP=2 over RoCE. That should push both throughput and context length significantly.

Also curious if Driver 595 + CUDA 13.2 makes a difference on the Spark like it reportedly does on RTX PRO 6000.

Dude you rock. Want to update my PR? I’d like to pull all your changes and test again :)

Also, in my testing, I found that MTP=2 seemed to yield better results? After each speculative call, vllm spits out the success ratio of each token. I was finding the third one was always (like 90% of the time) below 50% which I guess is the threshold for whether its utile or not.

If vLLM is counting each draft token and real token as tokens, it will spit out 2x speed, but the speed is actually the same. As I mentioned before, specdec (drafting) inherently does not work for batch size 1 MoEs. It is a limitation of the physics, not something that can be fixed with better code. Each additional token that is generated during specdec just has to be verified by the model, but verification means streaming the weights of the additional experts, and batch size 1 decode is bandwidth limited. There is no speedup because you just multiply the bandwidth needs.

Specdec can benefit production use cases for MoE models where you are batching large numbers of requests in parallel and you have a lot of extra compute capability on hand.

Please use llama-benchy to verify any performance improvements.