peastman, andradx
I think I found good and fast solution of your iteration problem.
global blocknum=0;
global completedblockquantity=0;
at the start of the block
shared current_block_num;
shared int currentiteration, blockindex;
if (threadIdx==0)
{
current_block_num=atomic_inc(blocknum);
currentiteration=current_block_num/blocks_per_iteration;
blockindex=current_block_num%blocks_per_iteration;
some code for waiting prior blocks to complete.
like this
wait while (completedblockquantity>=currentiteration*blocks_per_iteration)
}
__syncthreads();
data=datarray[blockindex*blocksize+threadIdx.x]
and so on
at the end of the block
__threadfence();
atomic_inc(completedblockquanity);
at kernell launch you need to mass blocks according to (data size)*(iteration quantity)