I read a few posts regarding cudaThreadSynchronize and what I can make out of it is that it waits until all threads finish execution in the kernel. For eg,
call kernel1 <<<grid,block>>>(…)
cudaThreadSynchronize();
// Here all threads are finished and device is ready.
But isn’t there an implicit barrier after all kernel invocation? Isn’t cudaThreadSynchronize() redundant to have after a kernel call?
Having said that, If I don’t have cudaThreadSynchronize() I do not get proper timing. Why is that? Can someone explain me the behaviour?
Thanks in advance.