Global memory usage on a CUDA device

I’ve a multi-threaded application that returns an out of memory error when performing a cudaMalloc. This error is reported only when the number of thread is bigger than 9; creating less than 9 threads, the application runs correctly without any error. Using cudaMemGetInfo to check application memory usage, I found out that the entire application uses almost 2.8GB of global memory (on a Tesla C2050), but my algorithm only allocates 100MB (each thread uses around 10MB). Who’s using the remaining 2.7GB?
I know that each host thread has its own CUDA context; this means, in my case, that each context uses almost 300MB. Is it possible?
Is there a simple and trusted way to compute CUDA context size?
Does context size also depend on hardware and OS?
Can I find somewhere (programming guide, reference manual, …) more information about CUDA context size and global memory usage?


On Fermi, context size can be larger, because there is on device heap space for printf support and in kernel malloc. 300Mb sounds a little on the large size, but context resource requirements certainly are not trivial. You can query the resource usage of a given thread by calling cudaThreadGetLimit, and you might be able to reduce the per thread memory usage a bit by calling cudaThreadSetLimit.

I personally don’t like the idea of running multiple contexts on a single device, especially from the same application. Apart from the resource footprint of each thread, context switching isn’t all that fast. I would recommend having a single worker thread hold a context for the life of the application, and have the other threads pass it work. If you use streams, you potentially can get kernel concurrency and overlap data transfer with computation, and there is only a single context resource overhead, and no context switching penalty.