Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing

jwitsoe · September 5, 2025, 5:24pm

Originally published at: Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing | NVIDIA Technical Blog

Large Language Models (LLMs) are at the forefront of AI innovation, but their massive size can complicate inference efficiency. Models such as Llama 3 70B and Llama 4 Scout 109B may require more memory than is included in the GPU, especially when including large context windows. For example, loading Llama 3 70B and Llama 4…

tom.schelsen · September 15, 2025, 12:25pm

Any ready made Docker container that can leverage this to serve LLMs with an OpenAI compatible API, without one having to re-implement the code shown in the blog post + the web server + metrics, etc…. ? Would the --cpu-offload-gb parameter in vLLM be equivalent, or is some performance left on the table using this instead of RMM ? Thanks

Topic		Replies	Views
CPU-GPU 메모리 공유를 통한 대규모 LLM 추론 및 KV 캐시 오프로드 가속화 Technical Blog - South Korea llama	1	88	September 9, 2025
Advanced Optimization Strategies for LLM Training on NVIDIA Grace Hopper Technical Blog	1	81	May 27, 2025
NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models Technical Blog llama	2	99	November 27, 2024
Mastering LLM Techniques: Inference Optimization Technical Blog	0	536	November 17, 2023
vLLM v0.8.4 shows UVM GPU1 BH process with high utilization CUDA Programming and Performance	7	676	April 25, 2025
vLLM custom for DGX Spark - STREAM LOADING and automatic KV cache DGX Spark / GB10 Projects nemotron	10	480	April 8, 2026
Deploying Retrieval-Augmented Generation Applications on NVIDIA GH200 Delivers Accelerated Performance Technical Blog	3	769	February 21, 2024
DGX Spark OS crash on llama4 launch DGX Spark / GB10	7	261	March 14, 2026
Multi GPU scaling using 4 A5500 GPU - Hardware cuda	7	1208	July 7, 2023
Compiling llama.cpp DGX Spark / GB10 llama	14	1712	February 7, 2026

Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing

Related topics