I’ve a multi-threaded application that returns an out of memory error when performing a cudaMalloc. This error is reported only when the number of thread is bigger than 9; creating less than 9 threads, the application runs correctly without any error. Using cudaMemGetInfo to check application memory usage, I found out that the entire application uses almost 2.8GB of global memory (on a Tesla C2050), but my algorithm only allocates 100MB (each thread uses around 10MB). Who’s using the remaining 2.7GB?
I know that each host thread has its own CUDA context; this means, in my case, that each context uses almost 300MB. Is it possible?
Is there a simple and trusted way to compute CUDA context size?
Does context size also depend on hardware and OS?
Can I find somewhere (programming guide, reference manual, …) more information about CUDA context size and global memory usage?