cudaDeviceSynchronize in this case prevents any subsequent code in that thread from running until the kernel launched by that thread completes (if one is launched). No other semantics are implied. It is not a device wide sync.
But there is any sync instruction to do device wide sync? Or there any way to guaranty that “**More Code” runs after the terminus of children in all threads of all the blocks?
If you want to do a device-wide sync, the proper method(s) in CUDA are:
Use the kernel launch itself as a device-wide sync, i.e. break up the code into 2 or more kernels.
Use Cooperative Groups. However the device wide sync in cooperative groups is only available for devices of cc6.x or higher. (And cooperative groups requires CUDA 9 or higher)
Yes, I was evaluating alternatives a that solution and avoid the break of several kernels in the host (in even with different streams). Alternatives with Dynamic Parallelism