This post summarizes the current state of my TurboQuant work on DGX Spark GB10 (Blackwell, SM121) with a patched vLLM 0.19.1 stack.
Repository:
https://github.com/bjk110/spark_vllm_docker/tree/feat/turboquant
Environment
-
Hardware: DGX Spark GB10 (Blackwell, SM121) ×2
-
Memory: 121 GB unified
-
Base image: NGC PyTorch 26.03
-
vLLM: 0.19.1 (
a7d79fa, source build) -
FlashInfer: 0.6.7 (CUTLASS 4.4.2, SM121 source build)
-
PyTorch: 2.11.0a0
-
CUDA: 13.2
-
Transformers: 5.5.0
-
TurboQuant base: vLLM PR #38280
-
CUDA WPH extension: AOT build, SM121,
BLOCK_D=128/256
Summary of changes
1. Initial TurboQuant integration
-
Ported PR #38280 onto vLLM 0.19.1
-
Resolved page-size mismatch with
_next_pow2(slot_bytes)padding -
Increased KV cache capacity from 155K to 413K tokens
2. Incremental decode
-
Added dirty-block-only decode
-
Added full decode fallback for CUDA graph capture path
-
Improved c=4 throughput from 31.2 to 36.3 t/s
3. Gather-free Triton decode
-
Removed
cache[flat_bt]full memcpy/gather path -
Triton kernel now reads paged cache directly
-
Added early exit for padding slots
-
Fixed odd-offset fp16 crash (
norm_offset=93) using byte-level_safe_view_fp16 -
Improved c=4 throughput from 36.3 to 39.9 t/s
4. CUDA WPH decode
-
Implemented AOT-built CUDA warp-per-head decode for SM121/aarch64
-
Added gather-free direct paged-cache read
-
Added
BLOCK_D=128/256template dispatch -
Moved norm read into the CUDA kernel
-
Added 4-warp CTA configuration (128 threads)
-
Current WPH path is slightly faster than Triton on Qwen3.5 at c=2 and c=4
Main issues encountered
| Issue | Cause | Resolution |
|---|---|---|
| Page-size mismatch | TurboQuant slot size mismatch with Mamba/GDN layers | _next_pow2() padding |
| CUDA graph capture crash | Python-side CPU ops (.tolist(), .unique()) |
capture-aware branching |
| WPH serving garbage output | output tensor mismatch under old path | gather-free direct-read conversion |
Odd norm_offset fp16 crash |
misaligned fp16 view at offset 93 | byte-level _safe_view_fp16 |
BLOCK_D=128 mismatch with head_dim=256 models |
WPH hardcoded for 128 | templated BLOCK_D=128/256 |
| Low occupancy with 1-warp CTA | insufficient scheduler utilization | 4-warp CTA |
| AOT build namespace error | at::cuda no longer valid in this path |
c10::cuda |
| Nemotron int32 block_table incompatibility | int32 block table input | cast to int64 |
Results
Qwen3.5-122B-A10B-NVFP4 (TP1)
| Metric | bf16 KV | TQ Triton | TQ WPH v2 |
|---|---|---|---|
| tg32 c=1 | 17.0 t/s | 14.1 t/s | 14.0 t/s |
| tg32 c=2 | 33.3 t/s | 23.5 t/s | 24.5 t/s |
| tg32 c=4 | 55.2 t/s | 39.7 t/s | 40.6 t/s |
| KV cache | 155K | 405K | 405K |
| Korean QA sanity | — | 12/12 pass | 12/12 pass |
Interpretation:
-
KV cache increased from 155K to 405K
-
Initial TurboQuant c=4 throughput was 31.2 t/s
-
Current WPH v2 c=4 throughput is 40.6 t/s
-
This is approximately a 30% recovery relative to the initial TurboQuant state, while keeping the larger KV cache budget
Nemotron-H 120B-A12B-NVFP4 (TP1)
| Metric | TQ Triton | TQ WPH v2 |
|---|---|---|
| tg32 c=1 | 15.1 t/s | 15.3 t/s |
| KV cache | 1,548K | 1,423K |
| Compatibility | OK | OK |
Gemma 4 31B-it
Current status: not supported in this branch.
Reason:
-
heterogeneous head_dim (256 + 512)
-
vLLM currently forcing
TRITON_ATTN -
likely requires per-layer backend dispatch or
BLOCK_D=512support
Regression test
Added:
scripts/test_wph_v2.py
Current status:
-
6/6 PASS
-
covers
BLOCK_D=128/256, outlier on/off, multi-head, and stress cases
Current status
At this point:
-
TurboQuant integration is functional on DGX Spark GB10
-
gather-free Triton decode is stable
-
CUDA WPH direct-read decode is functional in serving
-
4-warp CTA is currently the best-performing WPH configuration
-
On Qwen3.5, WPH is slightly faster than Triton at c=2 and c=4
-
The main practical gain remains the increased KV cache budget
Relevant commits
-
7fe59cf— TurboQuant full implementation -
bf4ca06— branch-specific README / development history -
f1e3239— WPH int32 fix + Nemotron/Gemma4 presets
Remaining work
Main items not addressed yet:
-
Better support for heterogeneous head_dim models
-
Additional CTA/layout experiments beyond the current 4-warp configuration
-
Possible cleanup for upstreaming the gather-free WPH path independently of the full branch
If useful, I can post a follow-up with implementation details or a phase-by-phase patch breakdown.