Huge Device Memory Overheads device memory loss with each new process

I am working on a project where there are several processes that each want to use the same GPU under CUDA 4.0. I’ve found that there are huge losses in device memory (due to overheads?) with every new process. By default, I lose about 202.75MB of device memory per process. I can play with cudaDeviceSetLimit() to force stack, heap, and Printf FIFO to 0, which reduces the overhead loss to ~181.6MB. Of course this is an impractical “solution.” I suppose if only one process were using the GPU this would be a somewhat acceptable loss, but I can only create a handful of processes before my GTX 470 runs out of memory, before I’ve even started running kernels!

I’ve seen hints of other people reporting this kind of problem, but I haven’t seen any definitive results. Is this a bug? Is this limitation documented? What is all that memory being used for?

The context overhead, most likely. When a cubin gets loaded, the driver preallocates all the memory it will potentially need to run whatever is in the cubin file- so up to 64k per Mp on the card for constant memory, 21k per byte of local memory in each kernel, etc. It can add up.

part of this is a bug that will be fixed in the next 4.0 driver

Just a random question, but do launch_bounds statements have any influence on the footprint of a given module when loaded? If they don’t, could they? It might be one mechanism of giving the programmer some control over “involuntary” memory usage.