Global thread barrier

peastman, andradx

I think I found good and fast solution of your iteration problem.

global blocknum=0;
global completedblockquantity=0;

at the start of the block

shared current_block_num;

shared int currentiteration, blockindex;

if (threadIdx==0)
{
current_block_num=atomic_inc(blocknum);

currentiteration=current_block_num/blocks_per_iteration;
blockindex=current_block_num%blocks_per_iteration;

some code for waiting prior blocks to complete.
like this

wait while (completedblockquantity>=currentiteration*blocks_per_iteration)
}

__syncthreads();

data=datarray[blockindex*blocksize+threadIdx.x]

and so on

at the end of the block

__threadfence();

atomic_inc(completedblockquanity);

at kernell launch you need to mass blocks according to (data size)*(iteration quantity)