I’m running a GPGPU application with CUDA on a GTX680. The application runs on a single CPU thread and a single CUDA stream. The application has many different kernels and many cudaMalloc and cudaFree. I see that some of the cudaMallocs take a VERY long time (2 seconds). Does anyone know why this is and how I can avoid it?
I generally avoid this issue by dividing my code into 3 different parts:
- Initialization ( setDevice, allocate all necessary buffers)
- Real-time processing ( pure kernels and computations, NO ALLOCATIONS )
- DeInit ( free buffers etc )
I believe the 1-2 sec. overhead is due to an initialization of the GPU ( “a warmup” ) that likely occurs on your first CUDA allocation.
I believe this init overhead can be avoided by setting the device in a previous step or executing sometime of warmup or keep warm daemon (running in background).
I agree that the 3 parts you mention above are the best way to go, but I can’t work in this manner because my application is a sequence of kernels, each one passing its output to the next kernel as input, while involving all kinds of memory objects such as Textures and Layered Surfaces. I do not have enough memory on the GPU to allocate it all in advance. I am alos using CUFFT, which has its own memory needs… Also, some of my allocation sizes depend on the actual input.
The 1-2 overhead I see is probably not the initialization (creation of a CUDA context) since it happens several times while the application runs, and not during the first cudaMalloc. I also see in the Nsight Timeline that this is not the case (a new CUDA context is not created). I’m attaching an image of the Nisght Timeline view.
Wouldn’t you potentially be able to allocate large pieces of data and reuse the same pointer to the data while making sure that you are not overwriting any buffers that are in use? If you are able to free and again allocate a new buffer this would imply that you have freed up unneeded resources that might instead be reusable?
Please also attach the mentioned nsight timeline. :-)
If the longer delay occurs only on the very first cudaMalloc(), it would seem to be initialization overhead for the CUDA context, which happens lazily. If you precede this first cudaMalloc() by a call to cudaFree(0), initialization overhead should switch to the cudaFree() call. If you see longish execution times for random cudaMalloc() calls, consider filing a bug report, attaching a self-contained repro case, because in general cudaMalloc() calls should not take 2 seconds.
If you are on Linux and running without X, I suggest you turn on persistent mode with nvidia-smi to prevent the driver from unloading, which causes longer initialization time due to re-loading of the driver.
I’ve updated to CUDA 5.5 and the issue still appears. I noticed that the long cudaMalloc() happens when the device memory allocated is close to the toatal GTX RAM size. That is, after allocating 2.9G Byte, I execute a memory allocation of additional 0.8G Byte, that will leave only about 300M Byte free out of the 4G Byte available on the card. The allocation succeeds but takes almost 2 seconds.
(I try to attach am image on Nsight timeline of this, but I get a constant [SCANNING… PLEASE WAIT] status…)
While the additional information indicates that this behavior is observed only in corner cases where CUDA is almost out of allocatable memory an execution time of 2 seconds does not seem right and worthy of closer inspection. I would recommend filing a bug report with a self-contained repro program. Thanks!