If kernel launches are asynchronous, one could start two kernels which run in parallel. Is there a way to wait for completion of execution of particular kernel or one should resort to using cudaThreadSynchronize() and wait for completion of all kernels?
Kernel launches are asynchronous that’s right. However you can not start kernels in parallel (as discussed here: http://forums.nvidia.com/index.php?showtopic=28823).
And also: