vLLM v0.8.4 shows UVM GPU1 BH process with high utilization

Robert_Crovella · April 23, 2025, 6:25pm

The implementation doesn’t appear to be making use of managed memory (where page faults might occur):

I try to re-implement CPU offloading in a fully transparent way: we offload the tensor to CPU, and let GPU directly view it as GPU tensor. It depends on UVA technology (no clear documentation, but there’re some public discussions), and per my discussion with nvidia experts, it works for systems with pinned memory.

I don’t have any info on the UVM GPU1 BH process, but it doesn’t appear to be unique to anything you’ve mentioned.

Topic		Replies	Views
Unified Memory for CUDA Beginners Technical Blog	46	3289	December 1, 2023
Unified Memory in CUDA 6 Technical Blog	87	2903	August 16, 2019
Simplifying GPU Application Development with Heterogeneous Memory Management Technical Blog	0	453	August 22, 2023
Introducing Low-Level GPU Virtual Memory Management Technical Blog	59	9223	June 4, 2024
PyTorch CUDACachingAllocator NVML assertion when sharing CUDA context with llama.cpp on Orin Nano 8 GB (JetPack 6.2.2) Jetson Orin Nano pytorch , generative_ai , llama	10	220	June 2, 2026
General Question about Jetsons GPU/CPU Shared Memory Usage Jetson TX2	34	8026	July 4, 2019
Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager Technical Blog	9	1222	March 27, 2021
NVML Support for DGX Spark Grace Blackwell Unified Memory - Community Solution DGX Spark / GB10 Projects cuda , kernel	7	744	April 4, 2026
DGX Spark becomes unresponsive (“zombie”) instead of throwing CUDA OOM DGX Spark / GB10	16	1702	April 10, 2026
Unified virtual memory slowdown even without migration CUDA Programming and Performance	6	1767	January 31, 2022

vLLM v0.8.4 shows UVM GPU1 BH process with high utilization

Related topics