KV Cache Quantization Benchmarks on DGX Spark — q4_0 vs q8_0 vs f16 (llama.cpp, Nemotron 30B, 128K context)

I spent a night benchmarking KV cache quantization on my DGX Spark running Nemotron-3-Nano-30B-A3B with 128K context via llama.cpp. The results surprised me - sharing them here so others don’t have to repeat the experiment.

Setup: - DGX Spark (GB10, compute 12.1, 128GB unified, CUDA 13.0, Driver 580.126.09) - llama.cpp build 8399 (aarch64 + CUDA), compiled from source - Nemotron-3-Nano-30B-A3B (Q4KXL GGUF) - 4 llama.cpp servers running simultaneously (Nemotron 30B, Qwen 3.5-9B, Gemma 3-12B, nomic-embed) - Tested --cache-type-k / --cache-type-v with f16 (default), q80, and q40

Prompt Processing Speed (tok/s):

Context f16 q4_0 q4_0 vs f16
~8K 371.3 363.4 -2.1%
~16K 360.7 346.2 -4.0%
~32K 328.3 316.9 -3.5%
~64K 282.7 21.3 -92.5%

Generation Speed (tok/s):

Context f16 q4_0 Delta
~8K 14.7 14.2 -3.4%
~16K 13.9 12.7 -8.6%
~32K 13.5 11.0 -18.5%
~64K 13.3 8.6 -35.3%

Memory (RSS):

Context f16 q4_0
~8K 1.25 GB 1.34 GB (+7%)
~32K 1.59 GB 1.69 GB (+6%)
~64K 1.94 GB 2.06 GB (+6%)

Key findings:

  1. q4_0 is 92% slower at 64K context - dequantization overhead destroys prompt processing. The HuggingFace team warned about this.

  2. q4_0 actually uses MORE memory than f16 - scale/zero-point metadata overhead exceeds compression savings on Spark’s unified memory architecture.

  3. q8_0 is the only quantization worth running - 2x compression, <5% speed hit at all context lengths.

  4. f16 is fine for most workloads. At 64K tokens the KV cache is under 2 GB out of 128 GB available. There’s no memory pressure to solve.

Recommendations for Spark users: - General chat (<16K ctx): f16 (default) - Long context (16-64K): --cache-type-k q8_0 --cache-type-v q8_0 - Very long context (64K+): Wait for TurboQuant or switch to TRT-LLM + NVFP4 - Maximum throughput: TRT-LLM + NVFP4 (hardware-accelerated, no dequant penalty)

What’s next: I’m building from the spiritbuun TurboQuant CUDA fork targeting sm_121. TurboQuant claims no dequantization overhead by design - if that holds up on Blackwell, the results could be dramatically different. Will post those benchmarks when I have them.

Full writeup with methodology: I Benchmarked KV Cache Quantization on My DGX Spark — Here’s Why I Went Back to f16

If you’re running similar tests on your Spark, drop your numbers - the community needs more real-world data.