I spent a night benchmarking KV cache quantization on my DGX Spark running Nemotron-3-Nano-30B-A3B with 128K context via llama.cpp. The results surprised me - sharing them here so others don’t have to repeat the experiment.
Setup: - DGX Spark (GB10, compute 12.1, 128GB unified, CUDA 13.0, Driver 580.126.09) - llama.cpp build 8399 (aarch64 + CUDA), compiled from source - Nemotron-3-Nano-30B-A3B (Q4KXL GGUF) - 4 llama.cpp servers running simultaneously (Nemotron 30B, Qwen 3.5-9B, Gemma 3-12B, nomic-embed) - Tested --cache-type-k / --cache-type-v with f16 (default), q80, and q40
Prompt Processing Speed (tok/s):
| Context | f16 | q4_0 | q4_0 vs f16 |
|---|---|---|---|
| ~8K | 371.3 | 363.4 | -2.1% |
| ~16K | 360.7 | 346.2 | -4.0% |
| ~32K | 328.3 | 316.9 | -3.5% |
| ~64K | 282.7 | 21.3 | -92.5% |
Generation Speed (tok/s):
| Context | f16 | q4_0 | Delta |
|---|---|---|---|
| ~8K | 14.7 | 14.2 | -3.4% |
| ~16K | 13.9 | 12.7 | -8.6% |
| ~32K | 13.5 | 11.0 | -18.5% |
| ~64K | 13.3 | 8.6 | -35.3% |
Memory (RSS):
| Context | f16 | q4_0 |
|---|---|---|
| ~8K | 1.25 GB | 1.34 GB (+7%) |
| ~32K | 1.59 GB | 1.69 GB (+6%) |
| ~64K | 1.94 GB | 2.06 GB (+6%) |
Key findings:
-
q4_0 is 92% slower at 64K context - dequantization overhead destroys prompt processing. The HuggingFace team warned about this.
-
q4_0 actually uses MORE memory than f16 - scale/zero-point metadata overhead exceeds compression savings on Spark’s unified memory architecture.
-
q8_0 is the only quantization worth running - 2x compression, <5% speed hit at all context lengths.
-
f16 is fine for most workloads. At 64K tokens the KV cache is under 2 GB out of 128 GB available. There’s no memory pressure to solve.
Recommendations for Spark users: - General chat (<16K ctx): f16 (default) - Long context (16-64K): --cache-type-k q8_0 --cache-type-v q8_0 - Very long context (64K+): Wait for TurboQuant or switch to TRT-LLM + NVFP4 - Maximum throughput: TRT-LLM + NVFP4 (hardware-accelerated, no dequant penalty)
What’s next: I’m building from the spiritbuun TurboQuant CUDA fork targeting sm_121. TurboQuant claims no dequantization overhead by design - if that holds up on Blackwell, the results could be dramatically different. Will post those benchmarks when I have them.
Full writeup with methodology: I Benchmarked KV Cache Quantization on My DGX Spark — Here’s Why I Went Back to f16
If you’re running similar tests on your Spark, drop your numbers - the community needs more real-world data.