Modulo is very expenssive ?

I was reading the 7 stages of optimization of Parallel Reduction.
In the first stage called Interleaved Addressing, the problem with it was the “highly divergent warps”, it says " % operator is very slow"

My question is that, was the strategy highly divergent or the use of % operator is some thing which is inefficient.


Both were probably problems. Highly divergent warps are bad and the modulo operator is really slow.