is modulus only slow in CUDA? seems to be quick for texture wrapping

The 8800 hardware supports texture wrapping, which I presume is implemented using integer or floating point modulus.

If the hardware supports such a thing, can someone please help me understand why this operation is so slow in CUDA? Can we expect this to improve with future releases?

G80 has special functional units for texturing, which presumably do 100% of the clamping, wrapping, bi/tri-linear interpolation and filtering. There’s then little need for most shaders to do modulo, thus it isn’t fast. I suspect the performance for G80 will remain the same except for special cases like powers of two that the compiler can implement with bitwise masking…


good point;
most textures are powers of two to begin with, so (coordinante % width) can become something like (coordinante & (width-1)).

I would be curious to see how texture wrapping performance works on non power-of-two textures.

Pow2 texture requirements are a thing of the past – we’ve supported general non-pow-2 textures since GeForce 6800. The texture units handle addressing, including coordinate wrapping.