Question regarding cudaThreadSynchronize() Does it act like a barrier?

I read a few posts regarding cudaThreadSynchronize and what I can make out of it is that it waits until all threads finish execution in the kernel. For eg,

call kernel1 <<<grid,block>>>(…)
// Here all threads are finished and device is ready.

But isn’t there an implicit barrier after all kernel invocation? Isn’t cudaThreadSynchronize() redundant to have after a kernel call?

Having said that, If I don’t have cudaThreadSynchronize() I do not get proper timing. Why is that? Can someone explain me the behaviour?

Thanks in advance.

Kernel calls are async so that your host can do some work while the gpu also works away.
There is an implicit barrier when you want to do a memcpy to bring the results back on the host.