Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

jwitsoe · December 8, 2025, 5:00pm

Originally published at: Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache | NVIDIA Technical Blog

Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory footprint and compute cost—directly improving throughput, latency, and achievable context length. This blog introduces NVFP4 KV cache quantization, a new KV format that enables significant performance gains on NVIDIA Blackwell…

Topic		Replies	Views
Introducing NVFP4 for Efficient and Accurate Low-Precision Inference Technical Blog	4	421	January 26, 2026
Mastering LLM Techniques: Inference Optimization Technical Blog	0	517	November 17, 2023
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo Technical Blog deepseek	1	81	September 18, 2025
Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs Technical Blog	0	42	January 22, 2026
NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance Technical Blog	3	221	July 17, 2025
NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit Technical Blog	1	124	August 25, 2025
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1231	January 7, 2026
Custom FP4 CUDA Kernel - 129 TFLOPS on DGX Spark with Pre-Quantized Weight Cache CUDA Programming and Performance cublas	4	182	February 25, 2026
NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 Technical Blog llama	2	84	November 27, 2024
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	16	1467	February 4, 2026

Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

Related topics