cudaDeviceSynchronize() returns cudaErrorMemoryAllocation

I am trying to call cudaDeviceSynchronize() in a kernel, with dynamic parallelism, in order to wait for the child kernels to be finished before going on with the parent kernels, this way:

global father (…){
. . .
. . .
__syncthreads();

if(condition met) //DIVERGENCE
{
  
  size_t shrbytesants = VALUE

  child<<<1,n_pnts_ants,VALUE>>>(...);

  cudaDeviceSynchronize();
  cudaError err = cudaGetLastError();
  printf("error : %s \n",cudaGetErrorString( err ));

}

__syncthreads();
}

Whiteout the cudaDeviceSynchronize() the code works but the fathers do not wait for the childs to be finished.
If i put cudaDeviceSynchronize(), the code compiles fine and execute without apparent errors, but cuda-memcheck shows a ""Program hit cudaErrorMemoryAllocation (error 2) due to “out of memory” on CUDA API call to cudaLaunch. “”
i am not getting if i’m doing something wrong with stack, heap memory but the fact that without cudaDeviceSynchronize() the code works suggests me that memory is ok.
Thanks

I suggest you read the entire CDP section in the programming guide. There are a number of considerations here, especially synchronization depth. The use of cudaDeviceSynchronize in kernel affects the amount of state that is required to be kept, which affects memory required for the kernel launch.

I doubt there is anything useful that can be said about your incomplete snippet.