DGX Spark GB10 / vLLM 0.19.1: TurboQuant KV cache integration results on Qwen3.5 and Nemotron, including gather-free Triton decode and CUDA WPH decode

bjk110 · April 5, 2026, 1:07am

This post summarizes the current state of my TurboQuant work on DGX Spark GB10 (Blackwell, SM121) with a patched vLLM 0.19.1 stack.

Repository:
https://github.com/bjk110/spark_vllm_docker/tree/feat/turboquant

Environment

Hardware: DGX Spark GB10 (Blackwell, SM121) ×2
Memory: 121 GB unified
Base image: NGC PyTorch 26.03
vLLM: 0.19.1 (a7d79fa, source build)
FlashInfer: 0.6.7 (CUTLASS 4.4.2, SM121 source build)
PyTorch: 2.11.0a0
CUDA: 13.2
Transformers: 5.5.0
TurboQuant base: vLLM PR #38280
CUDA WPH extension: AOT build, SM121, BLOCK_D=128/256

Summary of changes

1. Initial TurboQuant integration

Ported PR #38280 onto vLLM 0.19.1
Resolved page-size mismatch with _next_pow2(slot_bytes) padding
Increased KV cache capacity from 155K to 413K tokens

2. Incremental decode

Added dirty-block-only decode
Added full decode fallback for CUDA graph capture path
Improved c=4 throughput from 31.2 to 36.3 t/s

3. Gather-free Triton decode

Removed cache[flat_bt] full memcpy/gather path
Triton kernel now reads paged cache directly
Added early exit for padding slots
Fixed odd-offset fp16 crash (norm_offset=93) using byte-level _safe_view_fp16
Improved c=4 throughput from 36.3 to 39.9 t/s

4. CUDA WPH decode

Implemented AOT-built CUDA warp-per-head decode for SM121/aarch64
Added gather-free direct paged-cache read
Added BLOCK_D=128/256 template dispatch
Moved norm read into the CUDA kernel
Added 4-warp CTA configuration (128 threads)
Current WPH path is slightly faster than Triton on Qwen3.5 at c=2 and c=4

Main issues encountered

Issue	Cause	Resolution
Page-size mismatch	TurboQuant slot size mismatch with Mamba/GDN layers	`_next_pow2()` padding
CUDA graph capture crash	Python-side CPU ops (`.tolist()`, `.unique()`)	capture-aware branching
WPH serving garbage output	output tensor mismatch under old path	gather-free direct-read conversion
Odd `norm_offset` fp16 crash	misaligned fp16 view at offset 93	byte-level `_safe_view_fp16`
`BLOCK_D=128` mismatch with head_dim=256 models	WPH hardcoded for 128	templated `BLOCK_D=128/256`
Low occupancy with 1-warp CTA	insufficient scheduler utilization	4-warp CTA
AOT build namespace error	`at::cuda` no longer valid in this path	`c10::cuda`
Nemotron int32 block_table incompatibility	int32 block table input	cast to int64

Results

Qwen3.5-122B-A10B-NVFP4 (TP1)

Metric	bf16 KV	TQ Triton	TQ WPH v2
tg32 c=1	17.0 t/s	14.1 t/s	14.0 t/s
tg32 c=2	33.3 t/s	23.5 t/s	24.5 t/s
tg32 c=4	55.2 t/s	39.7 t/s	40.6 t/s
KV cache	155K	405K	405K
Korean QA sanity	—	12/12 pass	12/12 pass

Interpretation:

KV cache increased from 155K to 405K
Initial TurboQuant c=4 throughput was 31.2 t/s
Current WPH v2 c=4 throughput is 40.6 t/s
This is approximately a 30% recovery relative to the initial TurboQuant state, while keeping the larger KV cache budget

Nemotron-H 120B-A12B-NVFP4 (TP1)

Metric	TQ Triton	TQ WPH v2
tg32 c=1	15.1 t/s	15.3 t/s
KV cache	1,548K	1,423K
Compatibility	OK	OK

Gemma 4 31B-it

Current status: not supported in this branch.

Reason:

heterogeneous head_dim (256 + 512)
vLLM currently forcing TRITON_ATTN
likely requires per-layer backend dispatch or BLOCK_D=512 support

Regression test

Added:

scripts/test_wph_v2.py

Current status:

6/6 PASS
covers BLOCK_D=128/256, outlier on/off, multi-head, and stress cases

Current status

At this point:

TurboQuant integration is functional on DGX Spark GB10
gather-free Triton decode is stable
CUDA WPH direct-read decode is functional in serving
4-warp CTA is currently the best-performing WPH configuration
On Qwen3.5, WPH is slightly faster than Triton at c=2 and c=4
The main practical gain remains the increased KV cache budget

Relevant commits

7fe59cf — TurboQuant full implementation
bf4ca06 — branch-specific README / development history
f1e3239 — WPH int32 fix + Nemotron/Gemma4 presets

Remaining work

Main items not addressed yet:

Better support for heterogeneous head_dim models
Additional CTA/layout experiments beyond the current 4-warp configuration
Possible cleanup for upstreaming the gather-free WPH path independently of the full branch

If useful, I can post a follow-up with implementation details or a phase-by-phase patch breakdown.

NYCmt · April 5, 2026, 2:20am

Outstanding! Can’t wait to test it; past bedtime here :) Thank you for sharing all your insights and the repo.

aniculescu · April 6, 2026, 8:10pm

Thanks for sharing, I will move this to GB10 projects

vedcsolution · April 6, 2026, 8:21pm

Thanks for sharing

serapis · April 7, 2026, 6:32am

Very much interested in a follow-up and seeing your work being contributed back to spark-vllm-docker, too!

Given Gemma 4 31B is my daily driver I’d especially appreciate to see if you can get it to work.

opteron · April 7, 2026, 10:32pm

Thanks for sharing. Do I interpret it correctly that this early implementation/PoC allows using prompts utilizing ~ 2.6x more KV cache at small performance hit and no noticeable accuracy loss vs FP16?

Or the same as above , but rather so far, it only gives 2.6x cache compression in VRAM/unified memory? so still the same context size is used for prefill, but the benefit is that it takes 2.6x less VRAM/memory?

Topic		Replies	Views
Why Turboquant saves DGX twice DGX Spark / GB10	134	12170	May 31, 2026
KV Cache Quantization Benchmarks on DGX Spark — q4_0 vs q8_0 vs f16 (llama.cpp, Nemotron 30B, 128K context) DGX Spark / GB10 Projects jetson , llama , nemotron	0	1844	March 31, 2026
Qwen3.6-27B AWQ INT4 on DGX Spark (GB10) — only 1.8-4.9 tok/s decode with 285k token prompt, how to improve? DGX Spark / GB10	6	1106	May 29, 2026
Some new development work for Qwen3 on the Spark DGX Spark / GB10	5	854	February 3, 2026
Benchmark Report: Qwen3.6-35B-A3B-NVFP4 on NVIDIA DGX Spark, Jetson Thor, Blackwell 6000 Pro DGX Spark / GB10 Projects	11	4113	July 14, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6406	March 16, 2026
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	17	9173	April 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	9373	March 14, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	2329	December 22, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	13701	May 15, 2026