# Qwen3.5-122B NVFP4 on a single DGX Spark — 19.6 tok/s with MTP=3
Tested Qwen3.5-122B-A10B-NVFP4 (Sehyo’s quant) on a single DGX Spark (GB10, SM 12.1, 128GB). Sharing results and setup in case it helps other Spark owners.
## Results
| Test | Tokens | Time | Throughput |
|------|--------|------|------------|
| Latency (short prompt) | 3 | 458ms | — |
| Throughput (200 tokens) | 200 | 10.2s | **19.6 tok/s** |
| Sustained (500 tokens) | 500 | 25.5s | **19.6 tok/s** |
Thinking OFF, MTP=3, single user. Zero crashes across all tests.
## Setup
- **Base**: [spark-vllm-docker PR #98]( Add FlashInfer PR patching and K=64 SM120 CUTLASS fix by RobTand · Pull Request #98 · eugr/spark-vllm-docker · GitHub ) (FlashInfer 0.6.6 + K=64 CUTLASS patch)
- **vLLM**: 0.17.1rc1.dev170
- **Driver**: 580.126.09, CUDA 13.1
## What I changed
### 1. `compute_121a` → `compute_121f`
The Dockerfile and `build-and-copy.sh` default to `12.1a` (accelerated feature subset). Changed to `12.1f` (full feature set) in both files. This requires CUDA 13.0+ which the Spark has. The `fused_moe_120.so` in the FlashInfer JIT cache wheel is then compiled with `-arch sm_121f` instead of `sm_121a`.
After rebuilding: you must delete the old FlashInfer wheels (`rm wheels/flashinfer*.whl`) before running `build-and-copy.sh --rebuild-flashinfer`, because the script skips the rebuild if wheels already exist (even with the flag).
### 2. MTP=3 (Multi-Token Prediction)
Added `–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:3}’` to the serve command. Qwen3.5 natively supports MTP (resolved as `Qwen3_5MoeMTP`). This is the biggest performance lever on a single Spark. MTP=3 is a safe choice for 128GB — MTP=5 might work but memory gets tight.
### 3. Memory-constrained config for 128GB
| Parameter | Original recipe | My config |
|-----------|----------------|-----------|
| `gpu_memory_utilization` | 0.85 | 0.80 |
| `max_model_len` | 262144 | 32768 |
| `max_num_batched_tokens` | 8192 | 4096 |
| `max_num_seqs` | 32 | 8 |
Model loads at ~76 GiB (including MTP heads). At 0.80, the KV cache gets 17 GiB — enough for 32K context and MTP=3. Peak memory during shard loading hits ~117GB, so 0.85 works but leaves very little headroom with MTP on top.
## The autotuner TMA WS skips — they’re expected
```
[Autotuner]: Skipping tactic 14 … Failed to initialize cutlass TMA WS grouped gemm (M128_BS_group4)
[Autotuner]: Skipping tactic 15 … Failed to initialize cutlass TMA WS grouped gemm (M256_BS_group1)
```
These are K=128 tiles that need ~228KB SMEM. The B200 has 228KB SMEM, the Spark’s GB10 has 99KB. They can’t physically run on our hardware. The K=64 patch (FlashInfer PR #2786) adds tiles that fit in 99KB, and the autotuner selects those instead. This is correct behavior, not a fallback bug.
Per the Reddit benchmarks from VOIPMonitor, the K=64 patch itself gives ~2-6% improvement. The real throughput gain comes from MTP.
## Launch command
```bash
./launch-cluster.sh -t vllm-node-tf5 --solo \
–apply-mod mods/fix-qwen3.5-autoround \
--apply-mod mods/fix-qwen3.5-chat-template \
-d exec vllm serve sehyo/Qwen3.5-122B-A10B-NVFP4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.80 \\
--port 8000 --host 0.0.0.0 \
--enable-prefix-caching \\
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \\
--reasoning-parser qwen3 \
--max-num-batched-tokens 4096 \\
--trust-remote-code \
--chat-template unsloth.jinja \\
-tp 1 --max-num-seqs 8 \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```
## What’s next
Would love to hear from anyone running 2+ Sparks — with 256GB you could restore the full recipe params (262K context, max_num_seqs=32, MTP=5) and TP=2 over RoCE. That should push both throughput and context length significantly.
Also curious if Driver 595 + CUDA 13.2 makes a difference on the Spark like it reportedly does on RTX PRO 6000.