Optimizing Time to First Token with Fine-Grained KV Cache Blocks, Real-time Reuse, and Efficient Eviction Algorithms

Originally published at: Optimizing Time to First Token with Fine-Grained KV Cache Blocks, Real-time Reuse, and Efficient Eviction Algorithms | NVIDIA Technical Blog

In our previous blog post, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques…