I have a Quadro FX 1700 with Toolkit 2.1.1635 and SDK 2.10 whichs runs on Ubuntu 8.04 64 bit
for which i made an erossion filter. This filter uses one thread per vertical line and works just fine. But it prodcues incorrect results if introduce a ciculair buffer with modulo. In my test case(white image, black border) it results in black dots in the output image where the whole image should be completly white.
Apart from that, modulo is an expensive operation and is best avoided. I don’t know if the compiler is smart enough to eliminate the modulo if you insert a [font=“Courier New”]#pragma unroll 3[/font] before the loop.
32-bit integer modulo is not necessarily a very expensive operation. The compiler includes optimizations for 32-bit division and modulo with a constant divisor. Depending on the value of the constant divisor and the signedness of the operands this results in anywhere from 1 to about 8 machine instructions, of memory serves. cuobjdump can be used to check what gets generated for a particular case.
On sm_2x platforms, i.e. Fermi-class GPUs, even 32-bit integer modulo with a variable divisor is not all that expensive, about 17 instructions of inline code, as I recall (again, cuobjdump will show exactly what is being generated for a particular case). In other words, on Fermi the relative cost of 32-bit modulo compared to say, 32-bit integer add, is comparable to what one would encounter on a CPU. Therefore I would not recommend that programmers go out of their way to avoid 32-bit modulo if they don’t do this in their corresponding CPU code.
If it turns out that some code is dominated by the cost of modulo operations it’s of course worth thinking about alternatives, but that’s true for all platforms I am familiar with (including x86, PowerPC, SPARC, ARM), and not particular to GPUs.
[Later:]
I set up a small test app which I compiled with the CUDA 4.0 toolchain. I count 17 instructions for 32-bit unsigned integer modulo with variable divisor, 20 instructions for the 32-bit signed integer modulo with variable divisor.
Oh, I didn’t want to imply that Nvidia’s modulo implementation is inefficient in any way (it’s not). It’s just that with CUDA I’m always in full optimization mode (if we don’t care about speed, why run it on the GPU at all?), and when I see a handful of machine instructions that is easily avoided, I’d like to point it out:
modulo 3 is replaced with several instructions (as njuffa wrote)
modulo 4 is recognized by the compiler and changed to a single and instruction.
i noticed the same when i did a bitwise and of the 9 inputs of the erosion filter, each bitwise-and on a unsigned char resulted in 3 operations. therefore I’ve changed it into an add and one compare.