cudaFree extremely slow


When I am allocating a device memory (Size = 13 Millions x sizeof(double)) with cudaMalloc, the execution time is instantaneous. However, when I free the memory with cudaFree, i get 46 seconds for this instruction. Do you know if this timing is normal?



Can you post a minimal self-contained reproducer code, together with software configuration and hardware configuration details?

Hello njuffa,

Oh Sorry ! Here is a sample:

__device__ double *Prm_Arr;
__global__ void LAUNCHBOUNDS(MY_KERNEL_MAX_THREADS) Mykernel(...)
 //...somewhere within this kernel, using atomicAdd to this Prm_Arr variable

void Main()
  double *Prm_temp_Arr;
  // Resetting the parameters values to zero
  cudaMemset(Prm_temp_Arr, 0.0, PrmSz * sizeof(double));
  cudaMalloc(&Prm_temp_Arr, PrmSz * sizeof(double));
  cudaMemcpyToSymbol(Prm_Arr, &Prm_temp_Arr, sizeof(Prm_temp_Arr));
  // After the kernel has been launched, the cudaFree takes around 46s and I am just using one stream/GPU. 
  //However, if I place this cudaFree before the kernel launch, the execution time is instantaneous.

How about something self-contained? Hint: Code is self-contained if I can cut, paste, compile, run.

The concerned kernel is a thousands lines of code and I dont think that I will be able to put it for intellectual property reason: However, if I empty the same kernel function (no codes, only in comments) then launch it, the cudaFree is running instantaneously. So it seems that, the issue is coming between the kernel launch and the memory management of this variable. Do you have any idea of this?

Yes. The problem is not in fact with cudaFree(), but somewhere else in your code.

I forgot to mention that I am running Grid: dim3(262143, 2, 50) and Block : dim3(1, 512, 1)
So, undoubtedly, I am using too much resources. Do you know how can I identify the limit of the launch in terms of size?
Notes: There is no error from the kernel output by the way

It time it to execute cudaFree should have nothing to do with a kernel launch configuration. If you are using too many resources you will have a different issue.

I don’t see anything wrong with that launch configuration…

Try adding error-checking and analysis your code with cuda-memcheck.

Litter your code with cudaDeviceSynchronize(); and time the individual segments between the synchronize

I believe cudaFree() has an inbuilt Synchronize() so you don’t really want to call it if you are waiting for some parallel activities to complete.

I put cudaDeviceSynchronize() right after the kernel launch and it seems that now it’s the synchronization which takes too long instead of the cudaFree(). Note that I am using only one stream and single GPU and so i am not running anything else in parallel.

if cudaDevicesynchronize() takes a long time, that means the code ABOVE the synchronize is taking the time





If the 2nd synchronize is taking 45 seconds, that means it’s actually ‘somecode()’ causing the problem

But it was said that when using one stream, the instruction will be serialized so the somecode() has to be terminated before passing to the next line, will it?

Operations submitted into a stream will execute in the order they were submitted. The issue of host-device synchronization is orthogonal to this.

Do you think that a kernel will run faster If I run it multiple times with smaller grid/block size?

You seem to be opening 13421721600 threads… is that really the case?
in that case the time consumed is in the kernel itself as specified by previous people

The host thread will NOT block - it will asynchronously send instructions to the gpu.
The gpu will definitely block between kernel launches - but imagine the gpu worker as a second thread running in parallel - each host instruction launches a job onto the gpu worker thread… something like that.