cudaMalloc, cudaFree from different threads

bshucker · August 22, 2007, 4:24pm

Hi,

I’m running into some odd (and incorrect) behavior from cudaFree() when called from a different thread as cudaMalloc().

I’m using an 8800 GTX on a win32 machine, and my CUDA code is compiled into a DLL which I call into from java. When I create certain java objects, that causes a call into CUDA that allocates memory via cudaMalloc(). When the java garbage collector cleans the object up, if makes a CUDA call that frees the memory.

My allocation and freeing routines work fine when I call them one after the other in the same thread. When using cudaMalloc() from one thread and cudaFree() from the garbage collector (which runs in a different thread), the free does not appear to be effective (that is, I soon run out of memory as if I hadn’t freed anything), and I get a segfault when the program completes execution (perhaps when java unloads the DLL, but I’m not sure).

I don’t see this problem with cudaMallocArray() and cudaFreeArray().

Any ideas?

Thanks,

Brian

bshucker · August 22, 2007, 7:51pm

On further inspection, it appears that the two threads have two different address spaces (I must have missed that important fact in the programming guide). That explains my problem completely.

Does anyone have a good solution for sharing CUDA memory references between CPU threads? I can solve it with an extra layer of indirection (e.g., all java threads send messages to single CUDA thread which manages memory), but besides being far more complex, there will be a performance hit.

Brian

Mark_Harris · August 23, 2007, 9:49am

Correct, this should not work, as documented. See this thread:
[url=“http://forums.nvidia.com/index.php?showtopic=43545”]http://forums.nvidia.com/index.php?showtopic=43545[/url]

Note that we do have an internal feature request for migration of CUDA contexts between threads that could help with this, but it is not currently supported and we haven’t yet scheduled the feature.

Mark

bshucker · August 23, 2007, 1:12pm

I see…this is a very serious limitation. For example, I cannot build an efficient library with the following functions:

int handle = createCudaThing(…)
doOperationOnCudaThing(int handle, …)
readCudaThing(int handle, …)
freeCudaThing(int handle)

My intention is that many operations would be performed on the GPU before the Thing needs to be read back into host memory. It is unreasonable to ask a user to do everything in a single thread, especially since it does not match the standard interpretation of a thread (that is, threads share an address space), and it might be difficult to do (as in the Java garbage collector example).

The two options I can think of are:

Keep the state of my CudaThing in host memory and copy it to the GPU and back for every operation
Spawn a CUDA service thread and pass messages back and forth for every operation, so all CUDA operations come from the service thread

Neither of these options is going to be efficient. I may be better off using a different technology entirely.

I guess the purpose of my rant here is to ask two questions: Does anyone have a better solution? If not, can we get a shared or migratable context feature scheduled? For my applications at least, it’s a critical capability.

Brian

cmorrison · August 24, 2007, 11:48am

bshucker I agree 100% with what your saying. If your calling several GPU threads why on earth would you want to return to the CPU to do stuff then pass all the data back to the GPU to run another CUDA function.

Seems dumb to me,

Chris

vpodlozhnyuk · August 25, 2007, 6:56am

Brian, what is the performance hit you are talking about? There is one-to-one correspondence between CPU threads and CUDA contexts. So you should either explicitly do everything in one ‘service’ thread or move the context between ‘requester’ threads (not exposed in CUDA runtime yet). In both cases there exists certain amount of synchronization overhead (which depends on synchronization mechanizms being used, but probably the same for both cases?) plus context migration overhead in the second case. But in the first case no additional copies to the service thread need to be done, since CPU threads within the same process share host address space.

bshucker · August 27, 2007, 1:28pm

I think you are correct–the overhead comes in the form of synchonization, message passing, and context switching. It’s not as serious as having to copy the memory, but it’s still worse than being able to share the GPU context. A couple of inter-thread messages and context switches per GPU operation can add up.

Brian

Topic		Replies	Views
cudaMalloc and sharing between CPU threads CUDA Programming and Performance	0	4355	May 20, 2009
global cuda memory and os-threads CUDA Programming and Performance	13	12367	January 21, 2009
Data setup for multi-gpu program can't setup outside of thread? CUDA Programming and Performance	3	2781	July 20, 2007
cudaMalloc and threads "invalid device pointer" error CUDA Programming and Performance	4	5453	June 26, 2007
Reporting a problem with CUDA memory access in multiple OS threads CUDA Programming and Performance	4	4923	April 30, 2007
Why exactly cant you share CUDA ressources amongst different host threads? CUDA Programming and Performance	1	3758	November 26, 2009
pthread and CUDA a problem of using pthread with CUDA CUDA Programming and Performance	10	25349	January 3, 2008
CUDA + CPU threads CUDA Programming and Performance	5	11703	August 20, 2008
questions memory allocation and CUDA contexts CUDA Programming and Performance	7	11328	February 4, 2008
Referencing device memory from multiple threads CUDA Programming and Performance	2	3722	August 18, 2010

cudaMalloc, cudaFree from different threads

Related topics