Per-process GPU memory overhead

We have an application with a CUDA-optimized machine-learning component that has worked well in medium-scale deployment for several years. As we scale up now to larger deployments, we are having a problem with per-process GPU memory overhead.

The problem is that each Linux process attached to a GPU consumes more than 120 megabytes of memory in that GPU, just in driver overhead before our application runs. We observe roughly the same overhead in releases 3.1, 3.2, and 4.0. Is this overhead intentional and expected, or is it some bug or misconfiguration on our part?

For background, our machine-learning component runs in N processes, one for each CPU core available, launching CUDA kernels as needed on a GPU attached to that process/core. This architecture does a good job of balancing the load between CPU and GPU resources. Our CUDA kernels are simple and fast, using relatively little memory (ten megabytes or so), with no special sharing or concurrency features required.

Our “medium-scale” deployments would typically use servers with 4-8 processor cores and 1 or 2 2gb Fermi cards. Now that demands are higher and CPUs are denser, we’re moving to servers with 24-96 cores, and these configurations with 24-96 processes are requiring 3 to 12 gigabytes of GPU memory before the first application byte is allocated. That’s too much.

My guess is that there is one CUDA context created per process in its selected GPU. Each CUDA context has a virtual address space (in 4.0 UVA) encompassing all the GPU memories and exactly one Linux process address space, and each CUDA context has a large overhead of GPU-resident data structure. Is that correct, a context can be shared between CPU threads that share the same address space, but not between CPU threads that have a different address space? Does it matter whether we’re using the C API or driver API?

If this is true, the recommendation that CUDA contexts are analagous to CPU processes is very misleading, and should be revised to recommend using as few contexts as possible. Is this in fact the intended direction?

Thank you.