There is memory fence and block synchronization for cuda kernels. Is there a way to implement a device synchronization inside a cuda kernel, like cudaDeviceSynchronize() inside the kernel? The motivation is to run iterative computation inside the kernel, reduce the kernel launch overhead, and avoid the data transfer for each iteration.
There is no device-wide sync mechanism that is provided by CUDA that can be called from within a kernel.
You can run an iterative computation by calling a kernel repetetively. The kernel launch itself is a device-wide sync. This method should not require any data-transfer between host and device, and it may be possible to hide the launch overhead if you can call this kernel repetetively.
If there is a noticeable app-level performance impact from launch overhead (order of magnitude: 5 usec) it might be time to re-think the software architecture. I realize that small, short-duration kernels cannot always be avoided (and kernels that weren’t short-running in the past might have become so due to advances in hardware performance), but it is usually worth considering alternative approaches.