I was reading the 7 stages of optimization of Parallel Reduction.
In the first stage called Interleaved Addressing, the problem with it was the “highly divergent warps”, it says " % operator is very slow"
My question is that, was the strategy highly divergent or the use of % operator is some thing which is inefficient.