integer division and modulo

The integer result of the division of two integers ignores any remainder and returns only the truncated quotient. Modulo returns the integer remainder of that division. According to the Programming Guide, both of these operations are relatively slow. However, I need both of these values in my kernel. Are there any functions that will return both of these values and are more efficient than computing both of these values independently?

Not that I know of, although this a good idea since modulo and integer are often used together. It may be that the JIT optimizer is smart enough to remove the redundant code from both calls.

If you know that your input values are exactly representable in floating point, it may be worth casting them and using the floating point modulo and divide instead.

Had to go back to look some of this stuff up…

I was unable to find the range of integers that are exactly representable as floating point numbers, but I assume it’s based off the length of the mantissa. Assuming the IEEE-754 floating point standard, there is a 24-bit mantissa (with the leading ‘1’ dropped off, so really represented as 23-bits). That would give an upper limit of 2^24. My integers are all positive, so I don’t need to be concerned with the negative case. Are all positive integers from 2^24 down to zero exactly representable?

This leaves me with two concerns, however. My first concern is that, according to NVIDIA documentation, floating-point division is not IEEE-754 compliant. However, if these two integers are represented exactly, this may not matter.

My second concern is this. What if the quotient of two integers isn’t able to be represented exactly using floating point. This could probably cause some errors when truncating the value. Hypothetically speaking, if the quotient should be 6.9999999, but is represented as 7 (since 6.9999999 is unrepresentable), then the truncation will yield 7 rather than 6.

The values in my numerator and denominator will be in the ranges [0, 2^24] and [2^13, 2^23], respectively. The denominator will always be a power of 2. The same is untrue for the numerator.

Thanks!

To avoid the inaccuracy of floorf for computed results like 6.9999999 and for numbers that are not exactly representable, you can always add some epsilon and cast the result to an integer. This is cheaper than doing an integer division:

int a = 42; int b = 6; int c = a/b; // 16 cycles
is equivalent to
float a = 42.0f; float b = 6.0f; int c = (int)( (a/b)+0.001f ); // fp rcp+mul or div (2-4 cycles I believe) and fp add (1 cycle)

I don’t know if casting is internally the same as floorf(), but it should. Guess I have to check the PTX.

A cheap modulo can be achieved based on the above with less cycles than integer %.

If the denominator is always a power of 2, then division and modulo is as easy as right shift and bitwise-and! You can take the log2 of a power-of-two integer using the function __ffs(x). So, a/b == a>>__ffs(b ). a%b == a&(b-1).

You avoid this by specifying “round-to-zero.” EDIT: Ugh, it seems like the only rounding mode for division, unlike for multiplication and addition, is round-to-nearest-even. Hmm.

I can’t find how long fmod() takes. Any confirmation that it’s quick? If so, you’d take the modulus, do subtraction, then do regular float division (or rather, use __fdividef(x,y)). That should give perfect precision, I think.

For getting back an integer, the Guide has this to say:

Has anyone ever benchmarked conversion speeds of int->float or float->int? That’d be very interesting to know. On x86, they can be surprisingly costly. There are bit-level tricks to do conversion of small floats to ints sometimes used in x86 with doubles to avoid it. Such tricks could also work on the GPU but it’d depend on the hardware’s native conversion speed…
so if anyone has measured it, it’d be fascinating to know if it’s a cheap 4 clock instruction (like float + - *) or something more expensive.
Lots of geeky details: http://www.stereopsis.com/sree/fpu2006.html

Also, if you’re doing integer divides or modulo with a CONSTANT number, there’s amazing tricks that can shift those into multiplies and cleanup code. That’s usually done by an optimizing compiler, so looking at the generated PTX code would tell you if nvcc is smart enough or not.
Geeky details in a whole chapter (50 pages!) of Hacker’s Delight. http://www.hackersdelight.org/

I’ve only leafed through Hacker’s Delight in a bookstore, but the bulk of their division/remainder optimizations are a regurgitation of “Division of Invariant Integers Using Multiplication” by Granlund and Montgomery. http://citeseerx.ist.psu.edu/viewdoc/downl…p1&type=pdf

NVCC and the “JIT optimizer” do not appear to implement this method (from testing), however to address the problem of the original poster if in fact he is dividing by a constant, he can precompute and alter his code. I can post a calculator for said values if so desired.

In CUDA a 64 bit division would be converted to

// q = quotent, a = dividend, c = constant computed from divisor

unsigned long long q, a, c;

q = __umul64hi(a, c);

To get the remainder, simply compute r = a - (q * d) where d is the divisor.

-Patrick

Sorry, double post

Yes, all variations of _float2int[rn,rz,ru,rd] and __int2float_xx map to a single G80 instruction which costs 4 cycles. Even though float mul and add only support rounding to the nearest or toward zero, conversions to/from int support all 4 IEEE rounding modes.

Even more interesting I think is that __int_as_float and __float_as_int cost 0 cycles…

Let’s not get elitist here. If Granlund&Montogmery wrote a dense, indecipherable (for a programmer) paper, and Hacker’s Delight explained the same thing and made it accessible, then it’s not a “regurgitation.” People should be referred to hacker’s delight and not the “original.” (This is a total rant, but I hate how in science people always have the poser gene in them and have to refer to the “old school” the “first” etc paper or research. And when someone does the invaluable job of recasting the same, then – their work is equated with chewing something, swallowing it, and vomitting it back out? It holds science back!)

Unfortunately with the way IEEE defined a float’s exponent (it’s not two’s complement, it’s a biased positive integer), reinterpreting an int (eg 10) as a float gets you a very tiny number (eg 1e-44). Moreover, you get a denormal, which I don’t think is even supported by cuda (so actually, you just get 0). (You have to bitwise-and the exponent into the int before reinterpreting, which is what someone was saying regarding x86.)

denormals are supported for doubles on the double precision unit on GT200 architecture.

My point was that on the GPU you can afford to reinterpret floats as ints to perform bit masking and arithmetic operations on it, since there is no penalty associated with typecasting.

This is not the case on most current CPUs. Reinterpreting a float as an int typically adds a few cycles of latency, or more on architectures with separate integer and FP register files (e.g. AMD K8 and K10). So it is often more efficient on those to perform a few more FP operations than work directly on the mantissa and exponent.

Whereas on the GPU you are free to perform any dirty hack on FP numbers :)

Well, I personally have a problem with both. One treats it as a complicated mathematical problem while the other one treats it as “magic”.

The idea is trivial: It is just multiplication by the inverse, and the inverse is represented as 32 bit fixed-point number, so e.g. instead of a / 12 you write a * ((1 / 12) * 2^35) / 2 ^ 35.

Say, that there makes sense