cudaMalloc takes several seconds

NadavSeg · August 7, 2013, 9:30am

Hi All
I’m running a GPGPU application with CUDA on a GTX680. The application runs on a single CPU thread and a single CUDA stream. The application has many different kernels and many cudaMalloc and cudaFree. I see that some of the cudaMallocs take a VERY long time (2 seconds). Does anyone know why this is and how I can avoid it?
Thanks.

Jimmy_Pettersson · August 7, 2013, 10:21am

I generally avoid this issue by dividing my code into 3 different parts:

Initialization ( setDevice, allocate all necessary buffers)
Real-time processing ( pure kernels and computations, NO ALLOCATIONS )
DeInit ( free buffers etc )

I believe the 1-2 sec. overhead is due to an initialization of the GPU ( “a warmup” ) that likely occurs on your first CUDA allocation.

I believe this init overhead can be avoided by setting the device in a previous step or executing sometime of warmup or keep warm daemon (running in background).

NadavSeg · August 7, 2013, 11:20am

Thanks Jimmy,
I agree that the 3 parts you mention above are the best way to go, but I can’t work in this manner because my application is a sequence of kernels, each one passing its output to the next kernel as input, while involving all kinds of memory objects such as Textures and Layered Surfaces. I do not have enough memory on the GPU to allocate it all in advance. I am alos using CUFFT, which has its own memory needs… Also, some of my allocation sizes depend on the actual input.

The 1-2 overhead I see is probably not the initialization (creation of a CUDA context) since it happens several times while the application runs, and not during the first cudaMalloc. I also see in the Nsight Timeline that this is not the case (a new CUDA context is not created). I’m attaching an image of the Nisght Timeline view.

Jimmy_Pettersson · August 7, 2013, 12:03pm

Wouldn’t you potentially be able to allocate large pieces of data and reuse the same pointer to the data while making sure that you are not overwriting any buffers that are in use? If you are able to free and again allocate a new buffer this would imply that you have freed up unneeded resources that might instead be reusable?

Please also attach the mentioned nsight timeline. :-)

njuffa · August 7, 2013, 1:54pm

If the longer delay occurs only on the very first cudaMalloc(), it would seem to be initialization overhead for the CUDA context, which happens lazily. If you precede this first cudaMalloc() by a call to cudaFree(0), initialization overhead should switch to the cudaFree() call. If you see longish execution times for random cudaMalloc() calls, consider filing a bug report, attaching a self-contained repro case, because in general cudaMalloc() calls should not take 2 seconds.

If you are on Linux and running without X, I suggest you turn on persistent mode with nvidia-smi to prevent the driver from unloading, which causes longer initialization time due to re-loading of the driver.

NadavSeg · August 11, 2013, 11:03am

I’ve updated to CUDA 5.5 and the issue still appears. I noticed that the long cudaMalloc() happens when the device memory allocated is close to the toatal GTX RAM size. That is, after allocating 2.9G Byte, I execute a memory allocation of additional 0.8G Byte, that will leave only about 300M Byte free out of the 4G Byte available on the card. The allocation succeeds but takes almost 2 seconds.
(I try to attach am image on Nsight timeline of this, but I get a constant [SCANNING… PLEASE WAIT] status…)

njuffa · August 13, 2013, 2:09am

While the additional information indicates that this behavior is observed only in corner cases where CUDA is almost out of allocatable memory an execution time of 2 seconds does not seem right and worthy of closer inspection. I would recommend filing a bug report with a self-contained repro program. Thanks!

Topic		Replies	Views
CudaMalloc is taking huge time for first time, How to overcome this issue CUDA Programming and Performance cuda	1	1012	April 12, 2021
cudaMalloc taking 4 seconds CUDA Programming and Performance	4	801	November 23, 2011
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23136	July 8, 2011
First cudaMalloc() takes long time? CUDA Programming and Performance	13	16962	April 23, 2021
How to avoid the overhead in the beggining of every CUDA application? CUDA Programming and Performance	1	973	December 11, 2019
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2109	August 26, 2011
cudamalloc slow CUDA Programming and Performance	5	8219	November 13, 2015
Is there any possibility to create constexpr CUDA resource allocation? CUDA Programming and Performance	3	20	October 17, 2024
cudaMalloc, cudaFree speed CUDA Programming and Performance	2	3574	April 4, 2013
cufftComplex memory allocation very high CUDA Programming and Performance	8	6788	December 15, 2009

cudaMalloc takes several seconds

Related topics