about latency to free device memory

This is a newbie’s question.

I wonder how much time is needed to free the device memory by “cudaFree”.
Is it asynchronous and is it depending on the allocated memory size?
Recently, I came to know that the latency of it could be a significant factor to the performance of my applications. <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ /> .
In some cases, the time consumed to deallocate the device memory is around 10-100 microseconds and this seems ok.
But sometimes, it takes about 1000-2000 microseconds.
This huge latency occurrs randomly and makes the performance of my code down seriously.

I have measured the timing by the cutCreateTimer, cutStartTimer, and cutGetTimerValue functions and whether or not using threadSync does not help.

Is there anyone who knows the expected time for the cudaFree?

Thank you very much in advance.


Like I posted to some other topic recently: if performance is of the issue, always use your own memory pool that you can optimize for your own allocation patterns, Just grab a big block (or several) at the beginning of your program. Never rely on operating system alloc() and free() to be fast, and be sure to never use them in inner loops.

Oh, that’s good idea. My CPU program already has that kind of memory management tool, and I have to do similar job for the same thing on the GPU memory.

Thanks again,


Juss to share my experience. I had an CPU loop (that runs around 1000 or 2000 times) that had 2 calls to cudaMalloc(). It used to take “seconds” (like 20 or 40 seconds) for that loop itself to complete. So, When I did one massive allocation and shared it among the 1000 iterations – I found that it was just taking a few milliseconds.