DGX Spark GB10 / vLLM 0.19.1: TurboQuant KV cache integration results on Qwen3.5 and Nemotron, including gather-free Triton decode and CUDA WPH decode

This post summarizes the current state of my TurboQuant work on DGX Spark GB10 (Blackwell, SM121) with a patched vLLM 0.19.1 stack.

Repository:
https://github.com/bjk110/spark_vllm_docker/tree/feat/turboquant

Environment

  • Hardware: DGX Spark GB10 (Blackwell, SM121) ×2

  • Memory: 121 GB unified

  • Base image: NGC PyTorch 26.03

  • vLLM: 0.19.1 (a7d79fa, source build)

  • FlashInfer: 0.6.7 (CUTLASS 4.4.2, SM121 source build)

  • PyTorch: 2.11.0a0

  • CUDA: 13.2

  • Transformers: 5.5.0

  • TurboQuant base: vLLM PR #38280

  • CUDA WPH extension: AOT build, SM121, BLOCK_D=128/256

Summary of changes

1. Initial TurboQuant integration

  • Ported PR #38280 onto vLLM 0.19.1

  • Resolved page-size mismatch with _next_pow2(slot_bytes) padding

  • Increased KV cache capacity from 155K to 413K tokens

2. Incremental decode

  • Added dirty-block-only decode

  • Added full decode fallback for CUDA graph capture path

  • Improved c=4 throughput from 31.2 to 36.3 t/s

3. Gather-free Triton decode

  • Removed cache[flat_bt] full memcpy/gather path

  • Triton kernel now reads paged cache directly

  • Added early exit for padding slots

  • Fixed odd-offset fp16 crash (norm_offset=93) using byte-level _safe_view_fp16

  • Improved c=4 throughput from 36.3 to 39.9 t/s

4. CUDA WPH decode

  • Implemented AOT-built CUDA warp-per-head decode for SM121/aarch64

  • Added gather-free direct paged-cache read

  • Added BLOCK_D=128/256 template dispatch

  • Moved norm read into the CUDA kernel

  • Added 4-warp CTA configuration (128 threads)

  • Current WPH path is slightly faster than Triton on Qwen3.5 at c=2 and c=4

Main issues encountered

Issue Cause Resolution
Page-size mismatch TurboQuant slot size mismatch with Mamba/GDN layers _next_pow2() padding
CUDA graph capture crash Python-side CPU ops (.tolist(), .unique()) capture-aware branching
WPH serving garbage output output tensor mismatch under old path gather-free direct-read conversion
Odd norm_offset fp16 crash misaligned fp16 view at offset 93 byte-level _safe_view_fp16
BLOCK_D=128 mismatch with head_dim=256 models WPH hardcoded for 128 templated BLOCK_D=128/256
Low occupancy with 1-warp CTA insufficient scheduler utilization 4-warp CTA
AOT build namespace error at::cuda no longer valid in this path c10::cuda
Nemotron int32 block_table incompatibility int32 block table input cast to int64

Results

Qwen3.5-122B-A10B-NVFP4 (TP1)

Metric bf16 KV TQ Triton TQ WPH v2
tg32 c=1 17.0 t/s 14.1 t/s 14.0 t/s
tg32 c=2 33.3 t/s 23.5 t/s 24.5 t/s
tg32 c=4 55.2 t/s 39.7 t/s 40.6 t/s
KV cache 155K 405K 405K
Korean QA sanity 12/12 pass 12/12 pass

Interpretation:

  • KV cache increased from 155K to 405K

  • Initial TurboQuant c=4 throughput was 31.2 t/s

  • Current WPH v2 c=4 throughput is 40.6 t/s

  • This is approximately a 30% recovery relative to the initial TurboQuant state, while keeping the larger KV cache budget

Nemotron-H 120B-A12B-NVFP4 (TP1)

Metric TQ Triton TQ WPH v2
tg32 c=1 15.1 t/s 15.3 t/s
KV cache 1,548K 1,423K
Compatibility OK OK

Gemma 4 31B-it

Current status: not supported in this branch.

Reason:

  • heterogeneous head_dim (256 + 512)

  • vLLM currently forcing TRITON_ATTN

  • likely requires per-layer backend dispatch or BLOCK_D=512 support

Regression test

Added:

  • scripts/test_wph_v2.py

Current status:

  • 6/6 PASS

  • covers BLOCK_D=128/256, outlier on/off, multi-head, and stress cases

Current status

At this point:

  • TurboQuant integration is functional on DGX Spark GB10

  • gather-free Triton decode is stable

  • CUDA WPH direct-read decode is functional in serving

  • 4-warp CTA is currently the best-performing WPH configuration

  • On Qwen3.5, WPH is slightly faster than Triton at c=2 and c=4

  • The main practical gain remains the increased KV cache budget

Relevant commits

  • 7fe59cf — TurboQuant full implementation

  • bf4ca06 — branch-specific README / development history

  • f1e3239 — WPH int32 fix + Nemotron/Gemma4 presets

Remaining work

Main items not addressed yet:

  1. Better support for heterogeneous head_dim models

  2. Additional CTA/layout experiments beyond the current 4-warp configuration

  3. Possible cleanup for upstreaming the gather-free WPH path independently of the full branch

If useful, I can post a follow-up with implementation details or a phase-by-phase patch breakdown.

4 Likes

Outstanding! Can’t wait to test it; past bedtime here :) Thank you for sharing all your insights and the repo.