I have the problem, that my kernel should be completely processed before the program goes on.
So I am searching for something like “syncthreads” for ALL the threads.
Pseudocode:
...Do something...
g_CUDARTKernel<<< blocks, threads>>>();
"SYNC ALL THE KERNEL THREADS"
...Do something (here everything of the kernel has to be calculated) ...
cudaThreadSynchronize() syncs the host thread with the device waiting for all device ops to finish. And cudaStreamSynchronize() lets the host thread wait for all the device ops in the stream you specified (helpful if you are using more than one).
cudaThreadSynchronize() syncs the host thread with the device waiting for all device ops to finish. And cudaStreamSynchronize() lets the host thread wait for all the device ops in the stream you specified (helpful if you are using more than one).