Per-process GPU memory overhead

khsinclair · July 29, 2011, 2:43pm

We have an application with a CUDA-optimized machine-learning component that has worked well in medium-scale deployment for several years. As we scale up now to larger deployments, we are having a problem with per-process GPU memory overhead.

The problem is that each Linux process attached to a GPU consumes more than 120 megabytes of memory in that GPU, just in driver overhead before our application runs. We observe roughly the same overhead in releases 3.1, 3.2, and 4.0. Is this overhead intentional and expected, or is it some bug or misconfiguration on our part?

For background, our machine-learning component runs in N processes, one for each CPU core available, launching CUDA kernels as needed on a GPU attached to that process/core. This architecture does a good job of balancing the load between CPU and GPU resources. Our CUDA kernels are simple and fast, using relatively little memory (ten megabytes or so), with no special sharing or concurrency features required.

Our “medium-scale” deployments would typically use servers with 4-8 processor cores and 1 or 2 2gb Fermi cards. Now that demands are higher and CPUs are denser, we’re moving to servers with 24-96 cores, and these configurations with 24-96 processes are requiring 3 to 12 gigabytes of GPU memory before the first application byte is allocated. That’s too much.

My guess is that there is one CUDA context created per process in its selected GPU. Each CUDA context has a virtual address space (in 4.0 UVA) encompassing all the GPU memories and exactly one Linux process address space, and each CUDA context has a large overhead of GPU-resident data structure. Is that correct, a context can be shared between CPU threads that share the same address space, but not between CPU threads that have a different address space? Does it matter whether we’re using the C API or driver API?

If this is true, the recommendation that CUDA contexts are analagous to CPU processes is very misleading, and should be revised to recommend using as few contexts as possible. Is this in fact the intended direction?

Thank you.

Topic		Replies	Views
Huge Device Memory Overheads device memory loss with each new process CUDA Programming and Performance	3	6930	March 16, 2011
Predictable? how much device memory per device context creation. CUDA Programming and Performance	6	1976	March 31, 2016
Questions about multiple CPU threads on a single device Multiple context? CUDA Programming and Performance	1	3379	September 4, 2009
Contexts: Performance question overhead by switching the context CUDA Programming and Performance	3	2848	February 6, 2009
Is it possible using muliple context for a GPU. mulitple CPU thread CUDA Programming and Performance	10	4953	April 8, 2009
questions memory allocation and CUDA contexts CUDA Programming and Performance	7	11374	February 4, 2008
Multiple Independent Host Processes with One GPU Board CUDA Programming and Performance	1	6760	May 23, 2011
CudaContexts - are they paged in/out automatically ? CUDA Programming and Performance	2	4206	December 4, 2009
Determine CUDA Context Memory Usage CUDA Programming and Performance	0	574	November 9, 2018
Global memory usage on a CUDA device CUDA Programming and Performance	1	1422	January 26, 2011

Per-process GPU memory overhead

Related topics