cuda block synchronization

I have b number of blocks and each block has t number of threads. I can use


to synchronize the threads that are in a particular block. for example

__global void aFunction()
//execute something

But my problem is to synchronize all the threads in all the blocks. How can I do this?

You could break the kernel in 2 parts and use cudaThreadSynchronize() from host side.

In that case should I call the new kernel function 10 times from host?? Is it efficient?

Actually I don’t know :D
cudaThreadSynchronize is the only api that you have to synchronize all threads. Otherwise, you can try to implement something using global memory…

device bool sync[BLOCKS];

kernel1 () {

bool s=true;
for (int i=0; i<BLOCKS; i++) s=sync[i] && s;
while (!s) for (int i=0; i<BLOCKS; i++) s=sync[i] && s;

//now all threads should be synch.

This isn’t tested and probably there are also better solutions…

Launching a kernel is very fast.

And there is a very good reason to not synchronise between blocks that are running. Supposing I have a task that needs 300 blocks and my GPU can only execute 90 at a time, 90 are started and when some of those have completed (and exited) then more are started, only way new ones can be started is for some already running to exit. So the only ‘all-of-grid’ synchronisation that can happen is when all blocks have exited.

OK in theory it would work if you are only running 90 blocks or less, but then my laptop has a has a smaller GPU it might only execute 8 blocks at a time.

Supposing they made some way of saving a blocks context, the problem with that is that there are some applications where you want to run 100,000 blocks or millions of blocks. To save context you need to save the registers of all threads in the block, plus shared arrays/variables, plus preserve ‘local’ variables, that could total 15k per block, and 15k x 1 million blocks is 15 GB!!