synchronisation between blocks

Hi -

I just started to program in CUDA and was wondering if there was a way to synchronize between thread blocks. From my understanding and stepping through emulation mode, __syncthreads() only synchronizes threads within a block.

I am basically trying to implement a function that iterates over a matrix. Each block handles a submatrix. The catch is each block has to finish the current computation before any of the other blocks can go on to the next iteration. Something like a join():
Code:

for( int i=0;i<N;i++ )
{
// 1. each block does it’s computation
// 2. wait for each block to finish, ie: join()
// (tried a __syncthreads() here but only syncs threads inside the block)
}

I’m pretty sure the computation part is correct since if I use just a single block, everything comes out as it should. When I use multiple blocks things break.

One way I thought of doing it was to put the iteration loop on the host and the kernal takes the iteration number as an argument. Somethine like:

Code:

for( int i=0;i<N;i++ )
{
myFunc<<< threads,dim >>>( i );
}

But is this probably slower with all the host to device/function overhead?

Any help and pointers would be greatly appreciated.

You may use a global counter to count how many blocks have finished a certain task.

What would be the benefit of using several blocks if one can only run after another?
If that work has to be serialized and you cannot do anything about that, just use a big for-loop inside one block.