I think blocks gets launched in order, but don’t necessarily process in order. There’s no way to synchronize data between blocks except for the atomic operations. Generally, you have to finish all threads of all blocks, write the results to global memory, then start a new kernel call to utilize the new data (unless you can do it via atomics).
In parallel up to the number of blocks that can run at once (depends on the model of GPU and how you decide to code your solution ), after that the blocks waiting have to wait until one of the running blocks finishes up.
So if your application has 10000 blocks and your GPU can run say 36 of those at once, then 36 will be launched and 9964 will be wait, when some of the 1st 36 finish then a similar number of the waiting ones will start to replace them. The original 36 will probably not finish all at the same time or in order. So “make no assumptions about the order of block execution”