When I am allocating a device memory (Size = 13 Millions x sizeof(double)) with cudaMalloc, the execution time is instantaneous. However, when I free the memory with cudaFree, i get 46 seconds for this instruction. Do you know if this timing is normal?
__device__ double *Prm_Arr;
__global__ void LAUNCHBOUNDS(MY_KERNEL_MAX_THREADS) Mykernel(...)
{
//...somewhere within this kernel, using atomicAdd to this Prm_Arr variable
}
void Main()
{
double *Prm_temp_Arr;
// Resetting the parameters values to zero
cudaMemset(Prm_temp_Arr, 0.0, PrmSz * sizeof(double));
cudaMalloc(&Prm_temp_Arr, PrmSz * sizeof(double));
cudaMemcpyToSymbol(Prm_Arr, &Prm_temp_Arr, sizeof(Prm_temp_Arr));
Mykernel<<<..>>>(...)
// After the kernel has been launched, the cudaFree takes around 46s and I am just using one stream/GPU.
//However, if I place this cudaFree before the kernel launch, the execution time is instantaneous.
cudaFree(Prm_temp_Arr);
}
The concerned kernel is a thousands lines of code and I dont think that I will be able to put it for intellectual property reason: However, if I empty the same kernel function (no codes, only in comments) then launch it, the cudaFree is running instantaneously. So it seems that, the issue is coming between the kernel launch and the memory management of this variable. Do you have any idea of this?
I forgot to mention that I am running Grid: dim3(262143, 2, 50) and Block : dim3(1, 512, 1)
So, undoubtedly, I am using too much resources. Do you know how can I identify the limit of the launch in terms of size?
Notes: There is no error from the kernel output by the way
It time it to execute cudaFree should have nothing to do with a kernel launch configuration. If you are using too many resources you will have a different issue.
I don’t see anything wrong with that launch configuration…
Try adding error-checking and analysis your code with cuda-memcheck.
I put cudaDeviceSynchronize() right after the kernel launch and it seems that now it’s the synchronization which takes too long instead of the cudaFree(). Note that I am using only one stream and single GPU and so i am not running anything else in parallel.
But it was said that when using one stream, the instruction will be serialized so the somecode() has to be terminated before passing to the next line, will it?
You seem to be opening 13421721600 threads… is that really the case?
in that case the time consumed is in the kernel itself as specified by previous people
The host thread will NOT block - it will asynchronously send instructions to the gpu.
The gpu will definitely block between kernel launches - but imagine the gpu worker as a second thread running in parallel - each host instruction launches a job onto the gpu worker thread… something like that.