cudaMalloc, cudaFree from different threads


I’m running into some odd (and incorrect) behavior from cudaFree() when called from a different thread as cudaMalloc().

I’m using an 8800 GTX on a win32 machine, and my CUDA code is compiled into a DLL which I call into from java. When I create certain java objects, that causes a call into CUDA that allocates memory via cudaMalloc(). When the java garbage collector cleans the object up, if makes a CUDA call that frees the memory.

My allocation and freeing routines work fine when I call them one after the other in the same thread. When using cudaMalloc() from one thread and cudaFree() from the garbage collector (which runs in a different thread), the free does not appear to be effective (that is, I soon run out of memory as if I hadn’t freed anything), and I get a segfault when the program completes execution (perhaps when java unloads the DLL, but I’m not sure).

I don’t see this problem with cudaMallocArray() and cudaFreeArray().

Any ideas?



On further inspection, it appears that the two threads have two different address spaces (I must have missed that important fact in the programming guide). That explains my problem completely.

Does anyone have a good solution for sharing CUDA memory references between CPU threads? I can solve it with an extra layer of indirection (e.g., all java threads send messages to single CUDA thread which manages memory), but besides being far more complex, there will be a performance hit.


Correct, this should not work, as documented. See this thread:

Note that we do have an internal feature request for migration of CUDA contexts between threads that could help with this, but it is not currently supported and we haven’t yet scheduled the feature.


I see…this is a very serious limitation. For example, I cannot build an efficient library with the following functions:

int handle = createCudaThing(…)
doOperationOnCudaThing(int handle, …)
readCudaThing(int handle, …)
freeCudaThing(int handle)

My intention is that many operations would be performed on the GPU before the Thing needs to be read back into host memory. It is unreasonable to ask a user to do everything in a single thread, especially since it does not match the standard interpretation of a thread (that is, threads share an address space), and it might be difficult to do (as in the Java garbage collector example).

The two options I can think of are:

  1. Keep the state of my CudaThing in host memory and copy it to the GPU and back for every operation

  2. Spawn a CUDA service thread and pass messages back and forth for every operation, so all CUDA operations come from the service thread

Neither of these options is going to be efficient. I may be better off using a different technology entirely.

I guess the purpose of my rant here is to ask two questions: Does anyone have a better solution? If not, can we get a shared or migratable context feature scheduled? For my applications at least, it’s a critical capability.


bshucker I agree 100% with what your saying. If your calling several GPU threads why on earth would you want to return to the CPU to do stuff then pass all the data back to the GPU to run another CUDA function.

Seems dumb to me,


Brian, what is the performance hit you are talking about? There is one-to-one correspondence between CPU threads and CUDA contexts. So you should either explicitly do everything in one ‘service’ thread or move the context between ‘requester’ threads (not exposed in CUDA runtime yet). In both cases there exists certain amount of synchronization overhead (which depends on synchronization mechanizms being used, but probably the same for both cases?) plus context migration overhead in the second case. But in the first case no additional copies to the service thread need to be done, since CPU threads within the same process share host address space.

I think you are correct–the overhead comes in the form of synchonization, message passing, and context switching. It’s not as serious as having to copy the memory, but it’s still worse than being able to share the GPU context. A couple of inter-thread messages and context switches per GPU operation can add up.