I’m a newbie CUDA user.
I tried to rewrite some functions in my program to speed it up.
At first, I used the runtime API to do this, and the modified program worked well with a single CPU thread.
Since the program is designed to run with multiple CPU threads, I tried to use the driver API to share the data on devices (by pushing, poping context).
Currently, I implemented it in the following way:
- One thread “A” creates n CUDA contexts and allocates memory for each context.
- All n CPU threads are invoked simultaneously. Each CPU thread pushes the corresponding context and starts working (invoking kernels and transferring data between host and device)
- Repeat step 2. until all data are processed. Then, thread “A” releases the memory and destroys the contexts created in step 1.
My questions are:
Q1: I saw some suggestions saying that using multiple contexts in a single GPU is not good. But, is it still acceptable to create n CUDA contexts when n CPU threads work simultaneously and one context cannot be shared by multiple CPU threads at the same time?
Q2: The aforementioned method seems to work well if n is not too big.
But on my graphic card (9800GT with 512MB) the memory ran out if n > 13.
I did a simple experiment using the API cuMemGetInfo, and the free memory decreased 40-50 MB each time when I created one context (without allocating any other memory space).
So even the memory I allocated is less than 1MB per context, the memory still runs out. Would anyone please give me some instructions/suggestions to solve this problem?
Very sorry if my descriptions are not clear due to my poor English, and thank you very much for reading and replying.