According to the nVidia CUDA Programming Guide integer modulo is apparently very slow. Are there any alternatives? I can’t assume I will always be using a power of 2 so bit masking is not an option but I don’t need full 32-bit precision so floating-point might be if its faster. Or I saw someone mention in another thread that 64-bit integer modulo is fast but I’m not sure how reliable that claim is.

I can’t believe that 64 bit integer modulo is supposed to be fast, but (approximate) floating point reciprocal is indeed done by the special function unit in 16 (compute capability 1.x), 8 (c.c. 2.0) or 4 (c.c. 2.1) cycles/warp. On top of that you need 4 (1.x) or 1 (2.x) cycles for each of int->float conversion, multiplication with the reciprocal, and float->int conversion. Part of that may be overlapped if you have more than one division per thread.

WE have discussed this issue before, please check http://forums.nvidia.com/index.php?showtopic=106232&pid=589954&start=&st=#entry589954

fixed-point modulo suggested by @Sylvain Collange is a good alternative.

In my experiment, it is 2x faster than traditional approach which uses double precision to implement modulo.