I am working on a project where there are several processes that each want to use the same GPU under CUDA 4.0. I’ve found that there are huge losses in device memory (due to overheads?) with every new process. By default, I lose about 202.75MB of device memory per process. I can play with cudaDeviceSetLimit() to force stack, heap, and Printf FIFO to 0, which reduces the overhead loss to ~181.6MB. Of course this is an impractical “solution.” I suppose if only one process were using the GPU this would be a somewhat acceptable loss, but I can only create a handful of processes before my GTX 470 runs out of memory, before I’ve even started running kernels!
I’ve seen hints of other people reporting this kind of problem, but I haven’t seen any definitive results. Is this a bug? Is this limitation documented? What is all that memory being used for?