The way to sync blocks is to finish the kernel and launch a new one.
This is usually not as expensive as it sounds.
There is no way to sync blocks in general other than this. The whole block/grid abstraction means that you likely have (many!) more blocks than you have SMs, so some blocks will be waiting in a queue for other blocks to completely finish and the waiting block to take its place. The idea of waiting for all blocks to reach the same point in code uses the assumption that all blocks are actually being computed concurrently, which is not true in general.
This may sound a little bit funny, but the development of a parallel algorithm is a ‘state of mind’. I’ve seen so many student struggle with for example OOP after they were trained to use procedural methods. At one point, however, something clicks in the brain and they ‘think’ OOP. Some take an hour to reach this point, some weeks. The same is true for my CUDA classes.
Read as much on parallel programming as possible and when it all clicks, you’ll be able to solve your problem as well.
You’ll see for example that nobody is really ‘sync-ing all threads over all blocks’, but devised methods with several steps of many independent calculations. See for example the prefix sum algorithm.