CUDA provides no efficient, reliable inter-block synchronization except kernel launches themselves. If you split your calculation into two kernels and launch them in the same stream, you will be guaranteed that all the blocks in the first kernel are finished and device memory updated before the second kernel starts. The kernel launch overhead is low enough that you shouldn’t worry about it unless your kernels take less than 5 or 10 microseconds.