Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark)

dflash added pr to vllm ([Spec Decode] Add Sliding Window Attention support to DFlash drafter by jianc99 · Pull Request #40898 · vllm-project/vllm · GitHub), try

./build-and-copy.sh -t vllm-node --apply-vllm-pr 40898 -c

qwen3.5-122b-fp8, tp=2. dflash=15

── Run 1/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 5.27s = 48.5 tok/s (prompt: 23)
  [Code] 512 tokens in 8.69s = 58.9 tok/s (prompt: 30)
  [JSON] 1024 tokens in 21.26s = 48.1 tok/s (prompt: 48)
  [Math] 64 tokens in 1.29s = 49.6 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 29.94s = 68.4 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
  [Q&A] 256 tokens in 4.94s = 51.8 tok/s (prompt: 23)
  [Code] 512 tokens in 8.39s = 61.0 tok/s (prompt: 30)
  [JSON] 1024 tokens in 20.80s = 49.2 tok/s (prompt: 48)
  [Math] 64 tokens in 1.27s = 50.3 tok/s (prompt: 29)
  [LongCode] 2048 tokens in 29.84s = 68.6 tok/s (prompt: 37)