cudaFree extremely slow

Abdopensky · February 5, 2020, 6:42pm

Hello,

When I am allocating a device memory (Size = 13 Millions x sizeof(double)) with cudaMalloc, the execution time is instantaneous. However, when I free the memory with cudaFree, i get 46 seconds for this instruction. Do you know if this timing is normal?

Thanks

Abdoulaye

njuffa · February 5, 2020, 7:51pm

Can you post a minimal self-contained reproducer code, together with software configuration and hardware configuration details?

Abdopensky · February 5, 2020, 10:12pm

Hello njuffa,

Oh Sorry ! Here is a sample:

__device__ double *Prm_Arr;
__global__ void LAUNCHBOUNDS(MY_KERNEL_MAX_THREADS) Mykernel(...)
{
 //...somewhere within this kernel, using atomicAdd to this Prm_Arr variable
}

void Main()
{
  double *Prm_temp_Arr;
  // Resetting the parameters values to zero
  cudaMemset(Prm_temp_Arr, 0.0, PrmSz * sizeof(double));
  cudaMalloc(&Prm_temp_Arr, PrmSz * sizeof(double));
  cudaMemcpyToSymbol(Prm_Arr, &Prm_temp_Arr, sizeof(Prm_temp_Arr));
  Mykernel<<<..>>>(...)
  // After the kernel has been launched, the cudaFree takes around 46s and I am just using one stream/GPU. 
  //However, if I place this cudaFree before the kernel launch, the execution time is instantaneous.
  cudaFree(Prm_temp_Arr);
}

njuffa · February 5, 2020, 10:16pm

How about something self-contained? Hint: Code is self-contained if I can cut, paste, compile, run.

Abdopensky · February 5, 2020, 11:10pm

The concerned kernel is a thousands lines of code and I dont think that I will be able to put it for intellectual property reason: However, if I empty the same kernel function (no codes, only in comments) then launch it, the cudaFree is running instantaneously. So it seems that, the issue is coming between the kernel launch and the memory management of this variable. Do you have any idea of this?

njuffa · February 5, 2020, 11:14pm

Yes. The problem is not in fact with cudaFree(), but somewhere else in your code.

Abdopensky · February 6, 2020, 12:11am

I forgot to mention that I am running Grid: dim3(262143, 2, 50) and Block : dim3(1, 512, 1)
So, undoubtedly, I am using too much resources. Do you know how can I identify the limit of the launch in terms of size?
Notes: There is no error from the kernel output by the way

mnicely · February 6, 2020, 3:40am

It time it to execute cudaFree should have nothing to do with a kernel launch configuration. If you are using too many resources you will have a different issue.

I don’t see anything wrong with that launch configuration…

Try adding error-checking and analysis your code with cuda-memcheck.

sBc-Random · February 6, 2020, 6:19am

Litter your code with cudaDeviceSynchronize(); and time the individual segments between the synchronize

I believe cudaFree() has an inbuilt Synchronize() so you don’t really want to call it if you are waiting for some parallel activities to complete.

Abdopensky · February 6, 2020, 7:17am

I put cudaDeviceSynchronize() right after the kernel launch and it seems that now it’s the synchronization which takes too long instead of the cudaFree(). Note that I am using only one stream and single GPU and so i am not running anything else in parallel.

sBc-Random · February 6, 2020, 7:20am

if cudaDevicesynchronize() takes a long time, that means the code ABOVE the synchronize is taking the time

ie

{
…

cudaDeviceSynchronize();
somecode();
cudaDeviceSynchronize();

}

If the 2nd synchronize is taking 45 seconds, that means it’s actually ‘somecode()’ causing the problem

Abdopensky · February 6, 2020, 7:30am

But it was said that when using one stream, the instruction will be serialized so the somecode() has to be terminated before passing to the next line, will it?

njuffa · February 6, 2020, 7:33am

Operations submitted into a stream will execute in the order they were submitted. The issue of host-device synchronization is orthogonal to this.

Abdopensky · February 6, 2020, 7:38am

Do you think that a kernel will run faster If I run it multiple times with smaller grid/block size?

eyalhir74 · February 6, 2020, 10:28am

You seem to be opening 13421721600 threads… is that really the case?
in that case the time consumed is in the kernel itself as specified by previous people

sBc-Random · February 6, 2020, 1:33pm

The host thread will NOT block - it will asynchronously send instructions to the gpu.
The gpu will definitely block between kernel launches - but imagine the gpu worker as a second thread running in parallel - each host instruction launches a job onto the gpu worker thread… something like that.

Topic		Replies	Views
cudaFree painfully slow CUDA Programming and Performance	4	4590	January 29, 2010
cudaFree is slow CUDA Programming and Performance	5	2836	November 13, 2010
about latency to free device memory CUDA Programming and Performance	3	5556	February 18, 2008
cudaFree() error + loop CUDA Programming and Performance	1	6684	April 1, 2010
Calling kernel in a loop spends much time in cudaFree CUDA Programming and Performance	1	771	July 16, 2018
cudaFree time linearly depends on cublas call CUDA Programming and Performance	3	1048	March 26, 2013
Looping kernel calls Unspecified launch error on cudaFree() ?? CUDA Programming and Performance	5	1738	May 13, 2009
Problem freeing memory CUDA Programming and Performance	2	1298	July 3, 2009
16GB cudaMalloc() on A10 (24GB) takes ~300-400ms after previous cudaFree CUDA Programming and Performance tensorrt , cuda , driver	7	516	February 7, 2024
Very slow kernel launch after a number of kernel has been lauched. CUDA Programming and Performance	3	5593	June 7, 2010

cudaFree extremely slow

Related topics