Hi -
I just started to program in CUDA and was wondering if there was a way to synchronize between thread blocks. From my understanding and stepping through emulation mode, __syncthreads() only synchronizes threads within a block.
I am basically trying to implement a function that iterates over a matrix. Each block handles a submatrix. The catch is each block has to finish the current computation before any of the other blocks can go on to the next iteration. Something like a join():
Code:
for( int i=0;i<N;i++ )
{
// 1. each block does it’s computation
// 2. wait for each block to finish, ie: join()
// (tried a __syncthreads() here but only syncs threads inside the block)
}
I’m pretty sure the computation part is correct since if I use just a single block, everything comes out as it should. When I use multiple blocks things break.
One way I thought of doing it was to put the iteration loop on the host and the kernal takes the iteration number as an argument. Somethine like:
Code:
for( int i=0;i<N;i++ )
{
myFunc<<< threads,dim >>>( i );
}
But is this probably slower with all the host to device/function overhead?
Any help and pointers would be greatly appreciated.