Why is modulus so slow?

From all that I have read, I know that modulus in GPGPU code is slow, but why? I assume that it has to do with floating points not liking modulus, but I am kind of interested now.

Integer division and modulo are relatively slow because there is no direct hardware support (they compile to multiple instruction sequences). Floating point modulo is fast.

Integer modulo is also slow on CPUs for the same reason.

you can get a improvement by replacing the modulo op by the actual formula.

a%b == a - (b*(int)(a/b))

Why would that be faster?

That’s likely how it’s already implemented in the microcode… actual operator timings show divide at 10 clocks, mul at 4 clocks, add at 1 clock, and mod at 17 clocks… which adds reasonably closely.

You could use that benchmark program to test the explicit version yourself. It’s likely identical in speed to the builtin %.

Depending on what your divisor is, there are also bit-manipulation tricks to do modulus division that might be significantly faster.

use this equivalence

now to be honest im not sure if thats applicable only if B is a power of 2 or is it a general rule but give it a try might help you

If you 're not sure, you might just check yourself:

1%3 = 1

1&2 = 0

fine,B has to be a power of 2 for the trick to work.