I am trying to call cudaDeviceSynchronize() in a kernel, with dynamic parallelism, in order to wait for the child kernels to be finished before going on with the parent kernels, this way:
global father (…){
. . .
. . .
__syncthreads();
if(condition met) //DIVERGENCE
{
size_t shrbytesants = VALUE
child<<<1,n_pnts_ants,VALUE>>>(...);
cudaDeviceSynchronize();
cudaError err = cudaGetLastError();
printf("error : %s \n",cudaGetErrorString( err ));
}
__syncthreads();
}
Whiteout the cudaDeviceSynchronize() the code works but the fathers do not wait for the childs to be finished.
If i put cudaDeviceSynchronize(), the code compiles fine and execute without apparent errors, but cuda-memcheck shows a ""Program hit cudaErrorMemoryAllocation (error 2) due to “out of memory” on CUDA API call to cudaLaunch. “”
i am not getting if i’m doing something wrong with stack, heap memory but the fact that without cudaDeviceSynchronize() the code works suggests me that memory is ok.
Thanks