Actually I don’t know :D
cudaThreadSynchronize is the only api that you have to synchronize all threads. Otherwise, you can try to implement something using global memory…
device bool sync[BLOCKS];
kernel1 () {
…
__syncthreads();
sync[blockIdx.x]=true;
bool s=true;
for (int i=0; i<BLOCKS; i++) s=sync[i] && s;
while (!s) for (int i=0; i<BLOCKS; i++) s=sync[i] && s;
//now all threads should be synch.
}
This isn’t tested and probably there are also better solutions…
And there is a very good reason to not synchronise between blocks that are running. Supposing I have a task that needs 300 blocks and my GPU can only execute 90 at a time, 90 are started and when some of those have completed (and exited) then more are started, only way new ones can be started is for some already running to exit. So the only ‘all-of-grid’ synchronisation that can happen is when all blocks have exited.
OK in theory it would work if you are only running 90 blocks or less, but then my laptop has a has a smaller GPU it might only execute 8 blocks at a time.
Supposing they made some way of saving a blocks context, the problem with that is that there are some applications where you want to run 100,000 blocks or millions of blocks. To save context you need to save the registers of all threads in the block, plus shared arrays/variables, plus preserve ‘local’ variables, that could total 15k per block, and 15k x 1 million blocks is 15 GB!!