Hierarchical blocks/Thread processes

Hello all,
I’m beginning with Cuda coding a multi pass GPUalgorithm:

While(GPUreduction(convergenceflags,nbthreads)!=0){

CallGPUKernel(nbthreads,convergenceflags,d_datatoprocess...e

tc); // 1 thread for each d_datatoprocess elmt
}

The problem is that for high resolution grid (nbthread>512²), I have a big overhead due to the reduction process…

So I would like to process hierarchicaly storing a convergence flag on block instead of on thread. This increases perf of the reduction performances.

But my problem is that I would like to CPU filter blocks that have converged in order not to use GPU threads on already converged blocks.
The only way I see to do that is:

While(GPUreduction(blockconvergenceflags,nbblocks)!=0){//convergence test

ActiveBlocksIndices=CPUFilterConvergedBlocks(convergence); //gather non converged blocks in a tab

for(int i=0;i<ActiveBlocksIndices.size;i++)
CudaMemset(
d_toprocessthistime+i*NBTHREADPERBLOCKS,
ActiveBlocksIndices[i],
NBTHREADPERBLOCKS);

CallGPUKernelMinWork(
NBTHREADPERBLOCKS*ActiveBlocksIndices.size, //nbthreads
d_toprocessthistime, //redondant blockid tab->NBTHREADPERBLOCKS rendondancy per block
convergenceflags ,
d_datatoprocess…etc);

GPUupdateblockconvergenceflags(nbblocks,blockconvergenceflag
s,convergenceflags );//per block reduction

}

The problem is that I want to treat all threads of a block in a kernel call.
So for each ActiveBlock I need to copy its index for his NBTHREADPERBLOCKS composite threads inducing information redondancy…:(
Is There’s an other way to do this kind of hierarchical process?
I hope I have been clear enough:/