Questions about multiple CPU threads on a single device Multiple context?

Hello all,

I’m a newbie CUDA user.
I tried to rewrite some functions in my program to speed it up.
At first, I used the runtime API to do this, and the modified program worked well with a single CPU thread.
Since the program is designed to run with multiple CPU threads, I tried to use the driver API to share the data on devices (by pushing, poping context).

Currently, I implemented it in the following way:

  1. One thread “A” creates n CUDA contexts and allocates memory for each context.
  2. All n CPU threads are invoked simultaneously. Each CPU thread pushes the corresponding context and starts working (invoking kernels and transferring data between host and device)
  3. Repeat step 2. until all data are processed. Then, thread “A” releases the memory and destroys the contexts created in step 1.

My questions are:

Q1: I saw some suggestions saying that using multiple contexts in a single GPU is not good. But, is it still acceptable to create n CUDA contexts when n CPU threads work simultaneously and one context cannot be shared by multiple CPU threads at the same time?

Q2: The aforementioned method seems to work well if n is not too big.
But on my graphic card (9800GT with 512MB) the memory ran out if n > 13.
I did a simple experiment using the API cuMemGetInfo, and the free memory decreased 40-50 MB each time when I created one context (without allocating any other memory space).
So even the memory I allocated is less than 1MB per context, the memory still runs out. Would anyone please give me some instructions/suggestions to solve this problem?

Very sorry if my descriptions are not clear due to my poor English, and thank you very much for reading and replying.

Would anyone please give some suggestions or experiences?

I did the experiment on memory cost on another PC, and the free memory decreased about 33MB each time when I created one context (without allocating any other memory space). I want to know what’s the cause of the memory cost per context, and if there is any method that can decrease the cost.

Moreover, I have ever seen that there is a limit on the number of contexts per card (e.g. 16 in Windows). Is it still the case currently? Is it the best manipulation to modify the program to decrease the number of CPU threads to avoid creating so many contexts?

Thank you very much for reading and replying.