if a kernel is executed asynchronously, how to estimate the running time?
i.e:
Timer();
Kernel1<<<grid, block>>>();
Timer();
Can we get the real execution time of kernel? Or we have to use cudaThreadSynchronize() after kernel execution?
Timer();
Kernel1<<<grid, block>>>();
cudaThreadSynchronize()
Timer();