Hierarchical blocks/Thread processes

Hello all,
I’m beginning with Cuda coding a multi pass GPUalgorithm:



tc); // 1 thread for each d_datatoprocess elmt

The problem is that for high resolution grid (nbthread>512²), I have a big overhead due to the reduction process…

So I would like to process hierarchicaly storing a convergence flag on block instead of on thread. This increases perf of the reduction performances.

But my problem is that I would like to CPU filter blocks that have converged in order not to use GPU threads on already converged blocks.
The only way I see to do that is:

While(GPUreduction(blockconvergenceflags,nbblocks)!=0){//convergence test

ActiveBlocksIndices=CPUFilterConvergedBlocks(convergence); //gather non converged blocks in a tab

for(int i=0;i<ActiveBlocksIndices.size;i++)

NBTHREADPERBLOCKS*ActiveBlocksIndices.size, //nbthreads
d_toprocessthistime, //redondant blockid tab->NBTHREADPERBLOCKS rendondancy per block
convergenceflags ,

s,convergenceflags );//per block reduction


The problem is that I want to treat all threads of a block in a kernel call.
So for each ActiveBlock I need to copy its index for his NBTHREADPERBLOCKS composite threads inducing information redondancy…:(
Is There’s an other way to do this kind of hierarchical process?
I hope I have been clear enough:/