I’m new to CUDA, and when comparing the run time of my CUDA sofwtare VS the old plain C version, I did not notice lots of improvement (it was actually slower). I first thought that this was because the matrix operations I do only include relatively small matrix/vectors, but I just found out that the operation that is taking most of the time is the deallocation of the memory !
I can take up to 10 seconds to free an array of 9000000 floats. However, it does this only for the first array, ie if I allocated several of them (allocation is fast, nearly instantaneous) and then free them, the first takes forever to free, and the others are lightning fast !
I allocate the memory using [font=“Courier New”]cudaMalloc((void**)&(storage->data), size * sizeof(float))[/font] and free it using [font=“Courier New”]cudaFree(storage->data)[/font];
What am I doing wrong ?
Thanks in advance for any advice/insight on what my cause this and/or how to solve it !
It is possible that your timing is wrong, so that what you think is time for free() may actually be kernel execution time. In the CUDA runtime API, malloc/free/copy are blocking operations, but kernel launches (and things like CUBLAS function calls) are non blocking. For timing correctness, call cudaThreadSynchronize() after each execution, and the host thread will spinlock until the kernel has finished running. You might find it changes the look of your codes timing/profiling results.
Does this mean that each time I call a kernel, I should call [font=“Courier New”]cudaThreadSynchronize()[/font] afterwards to make sure that all threads are done ? I did not see that in either the programming guide or the SDK samples. Now that I looked again, nearly all kernel calls are followed by a copy, so this may act as the barrier ?
The standard copy functions will act as a barrier, yes (there are also non-blocking asynchronous versions of copy which can be used in some circumstances too). cudaThreadSynchronize isn’t necessary unless you are interested in timing kernels with host side timers. The asynchronous nature of kernel launches can often be exploited to allow the host CPU to be doing something else while the GPU is busy.