KV Cache Quantization Benchmarks on DGX Spark — q4_0 vs q8_0 vs f16 (llama.cpp, Nemotron 30B, 128K context)

nmaine · March 31, 2026, 1:44am

I spent a night benchmarking KV cache quantization on my DGX Spark running Nemotron-3-Nano-30B-A3B with 128K context via llama.cpp. The results surprised me - sharing them here so others don’t have to repeat the experiment.

Setup: - DGX Spark (GB10, compute 12.1, 128GB unified, CUDA 13.0, Driver 580.126.09) - llama.cpp build 8399 (aarch64 + CUDA), compiled from source - Nemotron-3-Nano-30B-A3B (Q4KXL GGUF) - 4 llama.cpp servers running simultaneously (Nemotron 30B, Qwen 3.5-9B, Gemma 3-12B, nomic-embed) - Tested --cache-type-k / --cache-type-v with f16 (default), q80, and q40

Prompt Processing Speed (tok/s):

Context	f16	q4_0	q4_0 vs f16
~8K	371.3	363.4	-2.1%
~16K	360.7	346.2	-4.0%
~32K	328.3	316.9	-3.5%
~64K	282.7	21.3	-92.5%

Generation Speed (tok/s):

Context	f16	q4_0	Delta
~8K	14.7	14.2	-3.4%
~16K	13.9	12.7	-8.6%
~32K	13.5	11.0	-18.5%
~64K	13.3	8.6	-35.3%

Memory (RSS):

Context	f16	q4_0
~8K	1.25 GB	1.34 GB (+7%)
~32K	1.59 GB	1.69 GB (+6%)
~64K	1.94 GB	2.06 GB (+6%)

Key findings:

q4_0 is 92% slower at 64K context - dequantization overhead destroys prompt processing. The HuggingFace team warned about this.
q4_0 actually uses MORE memory than f16 - scale/zero-point metadata overhead exceeds compression savings on Spark’s unified memory architecture.
q8_0 is the only quantization worth running - 2x compression, <5% speed hit at all context lengths.
f16 is fine for most workloads. At 64K tokens the KV cache is under 2 GB out of 128 GB available. There’s no memory pressure to solve.

Recommendations for Spark users: - General chat (<16K ctx): f16 (default) - Long context (16-64K): --cache-type-k q8_0 --cache-type-v q8_0 - Very long context (64K+): Wait for TurboQuant or switch to TRT-LLM + NVFP4 - Maximum throughput: TRT-LLM + NVFP4 (hardware-accelerated, no dequant penalty)

What’s next: I’m building from the spiritbuun TurboQuant CUDA fork targeting sm_121. TurboQuant claims no dequantization overhead by design - if that holds up on Blackwell, the results could be dramatically different. Will post those benchmarks when I have them.

Full writeup with methodology: I Benchmarked KV Cache Quantization on My DGX Spark — Here’s Why I Went Back to f16

If you’re running similar tests on your Spark, drop your numbers - the community needs more real-world data.

Topic		Replies	Views
DGX Spark GB10 / vLLM 0.19.1: TurboQuant KV cache integration results on Qwen3.5 and Nemotron, including gather-free Triton decode and CUDA WPH decode DGX Spark / GB10 Projects nemotron	5	1431	April 7, 2026
Why Turboquant saves DGX twice DGX Spark / GB10	114	9488	April 27, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5132	March 16, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	9559	April 9, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	231	10397	April 21, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	7086	March 28, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	26	7292	April 29, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1923	December 22, 2025
TensorRT-LLM + nvidia/Llama-3.3-70B-Instruct-NVFP4 = 5 tok/s DGX Spark / GB10 llama	4	609	February 1, 2026
Collecting eval results for Spark-sized quants of models DGX Spark / GB10 benchmarks , llm	38	1054	May 2, 2026

KV Cache Quantization Benchmarks on DGX Spark — q4_0 vs q8_0 vs f16 (llama.cpp, Nemotron 30B, 128K context)

Related topics