cudaFree painfully slow

Wookai · January 28, 2010, 11:02pm

Hi guys,

I’m new to CUDA, and when comparing the run time of my CUDA sofwtare VS the old plain C version, I did not notice lots of improvement (it was actually slower). I first thought that this was because the matrix operations I do only include relatively small matrix/vectors, but I just found out that the operation that is taking most of the time is the deallocation of the memory !

I can take up to 10 seconds to free an array of 9000000 floats. However, it does this only for the first array, ie if I allocated several of them (allocation is fast, nearly instantaneous) and then free them, the first takes forever to free, and the others are lightning fast !

I allocate the memory using [font=“Courier New”]cudaMalloc((void**)&(storage->data), size * sizeof(float))[/font] and free it using [font=“Courier New”]cudaFree(storage->data)[/font];

What am I doing wrong ?

Thanks in advance for any advice/insight on what my cause this and/or how to solve it !

avidday · January 28, 2010, 11:13pm

It is possible that your timing is wrong, so that what you think is time for free() may actually be kernel execution time. In the CUDA runtime API, malloc/free/copy are blocking operations, but kernel launches (and things like CUBLAS function calls) are non blocking. For timing correctness, call cudaThreadSynchronize() after each execution, and the host thread will spinlock until the kernel has finished running. You might find it changes the look of your codes timing/profiling results.

Wookai · January 29, 2010, 3:26am

Oh, I wasn’t aware of that !

Does this mean that each time I call a kernel, I should call [font=“Courier New”]cudaThreadSynchronize()[/font] afterwards to make sure that all threads are done ? I did not see that in either the programming guide or the SDK samples. Now that I looked again, nearly all kernel calls are followed by a copy, so this may act as the barrier ?

avidday · January 29, 2010, 7:32am

The standard copy functions will act as a barrier, yes (there are also non-blocking asynchronous versions of copy which can be used in some circumstances too). cudaThreadSynchronize isn’t necessary unless you are interested in timing kernels with host side timers. The asynchronous nature of kernel launches can often be exploited to allow the host CPU to be doing something else while the GPU is busy.

Wookai · January 29, 2010, 3:20pm

Ok, I understand. Thanks a lot for your help !

Topic		Replies	Views
cudaFree extremely slow CUDA Programming and Performance	15	2111	February 6, 2020
cudaFree() error + loop CUDA Programming and Performance	1	6682	April 1, 2010
Improving Cuda-kernels performance CUDA Programming and Performance	5	9316	February 10, 2009
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2363	September 3, 2010
Odd performance problem/question CUDA Programming and Performance	3	830	June 3, 2009
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	919	July 26, 2022
Inconsistent kernel run times CUDA Programming and Performance	12	5781	August 5, 2009
slow kernel CUDA Programming and Performance	4	1444	June 25, 2009
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13067	July 9, 2008
Asynchronous problem with cudaMalloc CUDA Programming and Performance	2	943	May 22, 2023

cudaFree painfully slow

Related topics