I’m beginning with Cuda coding a multi pass GPUalgorithm:
tc); // 1 thread for each d_datatoprocess elmt
The problem is that for high resolution grid (nbthread>512Â²), I have a big overhead due to the reduction process…
So I would like to process hierarchicaly storing a convergence flag on block instead of on thread. This increases perf of the reduction performances.
But my problem is that I would like to CPU filter blocks that have converged in order not to use GPU threads on already converged blocks.
The only way I see to do that is:
ActiveBlocksIndices=CPUFilterConvergedBlocks(convergence); //gather non converged blocks in a tab
d_toprocessthistime, //redondant blockid tab->NBTHREADPERBLOCKS rendondancy per block
s,convergenceflags );//per block reduction
The problem is that I want to treat all threads of a block in a kernel call.
So for each ActiveBlock I need to copy its index for his NBTHREADPERBLOCKS composite threads inducing information redondancy…:(
Is There’s an other way to do this kind of hierarchical process?
I hope I have been clear enough:/