integer division and modulo

bdg146psu · November 6, 2008, 4:02pm

The integer result of the division of two integers ignores any remainder and returns only the truncated quotient. Modulo returns the integer remainder of that division. According to the Programming Guide, both of these operations are relatively slow. However, I need both of these values in my kernel. Are there any functions that will return both of these values and are more efficient than computing both of these values independently?

Simon_Green · November 6, 2008, 4:38pm

Not that I know of, although this a good idea since modulo and integer are often used together. It may be that the JIT optimizer is smart enough to remove the redundant code from both calls.

If you know that your input values are exactly representable in floating point, it may be worth casting them and using the floating point modulo and divide instead.

bdg146psu · November 6, 2008, 6:20pm

Had to go back to look some of this stuff up…

I was unable to find the range of integers that are exactly representable as floating point numbers, but I assume it’s based off the length of the mantissa. Assuming the IEEE-754 floating point standard, there is a 24-bit mantissa (with the leading ‘1’ dropped off, so really represented as 23-bits). That would give an upper limit of 2^24. My integers are all positive, so I don’t need to be concerned with the negative case. Are all positive integers from 2^24 down to zero exactly representable?

This leaves me with two concerns, however. My first concern is that, according to NVIDIA documentation, floating-point division is not IEEE-754 compliant. However, if these two integers are represented exactly, this may not matter.

My second concern is this. What if the quotient of two integers isn’t able to be represented exactly using floating point. This could probably cause some errors when truncating the value. Hypothetically speaking, if the quotient should be 6.9999999, but is represented as 7 (since 6.9999999 is unrepresentable), then the truncation will yield 7 rather than 6.

The values in my numerator and denominator will be in the ranges [0, 2^24] and [2^13, 2^23], respectively. The denominator will always be a power of 2. The same is untrue for the numerator.

Thanks!

e.ping · November 6, 2008, 9:59pm

To avoid the inaccuracy of floorf for computed results like 6.9999999 and for numbers that are not exactly representable, you can always add some epsilon and cast the result to an integer. This is cheaper than doing an integer division:

int a = 42; int b = 6; int c = a/b; // 16 cycles
is equivalent to
float a = 42.0f; float b = 6.0f; int c = (int)( (a/b)+0.001f ); // fp rcp+mul or div (2-4 cycles I believe) and fp add (1 cycle)

I don’t know if casting is internally the same as floorf(), but it should. Guess I have to check the PTX.

A cheap modulo can be achieved based on the above with less cycles than integer %.

alex_dubinsky · November 7, 2008, 2:50am

If the denominator is always a power of 2, then division and modulo is as easy as right shift and bitwise-and! You can take the log2 of a power-of-two integer using the function __ffs(x). So, a/b == a>>__ffs(b ). a%b == a&(b-1).

You avoid this by specifying “round-to-zero.” EDIT: Ugh, it seems like the only rounding mode for division, unlike for multiplication and addition, is round-to-nearest-even. Hmm.

I can’t find how long fmod() takes. Any confirmation that it’s quick? If so, you’d take the modulus, do subtraction, then do regular float division (or rather, use __fdividef(x,y)). That should give perfect precision, I think.

For getting back an integer, the Guide has this to say:

SPWorley · November 7, 2008, 9:02am

Has anyone ever benchmarked conversion speeds of int->float or float->int? That’d be very interesting to know. On x86, they can be surprisingly costly. There are bit-level tricks to do conversion of small floats to ints sometimes used in x86 with doubles to avoid it. Such tricks could also work on the GPU but it’d depend on the hardware’s native conversion speed…
so if anyone has measured it, it’d be fascinating to know if it’s a cheap 4 clock instruction (like float + - *) or something more expensive.
Lots of geeky details: [url=“Know Your FPU, 2006 Edition”]http://www.stereopsis.com/sree/fpu2006.html[/url]

Also, if you’re doing integer divides or modulo with a CONSTANT number, there’s amazing tricks that can shift those into multiplies and cleanup code. That’s usually done by an optimizing compiler, so looking at the generated PTX code would tell you if nvcc is smart enough or not.
Geeky details in a whole chapter (50 pages!) of Hacker’s Delight. [url=“http://www.hackersdelight.org/”]Hackersdelight.org

pstach · November 7, 2008, 7:27pm

I’ve only leafed through Hacker’s Delight in a bookstore, but the bulk of their division/remainder optimizations are a regurgitation of “Division of Invariant Integers Using Multiplication” by Granlund and Montgomery. http://citeseerx.ist.psu.edu/viewdoc/downl…p1&type=pdf

NVCC and the “JIT optimizer” do not appear to implement this method (from testing), however to address the problem of the original poster if in fact he is dividing by a constant, he can precompute and alter his code. I can post a calculator for said values if so desired.

In CUDA a 64 bit division would be converted to

// q = quotent, a = dividend, c = constant computed from divisor

unsigned long long q, a, c;

q = __umul64hi(a, c);

To get the remainder, simply compute r = a - (q * d) where d is the divisor.

-Patrick

pstach · November 7, 2008, 8:03pm

Sorry, double post

Sylvain_Collange · November 8, 2008, 11:43am

Yes, all variations of _float2int[rn,rz,ru,rd] and __int2float_xx map to a single G80 instruction which costs 4 cycles. Even though float mul and add only support rounding to the nearest or toward zero, conversions to/from int support all 4 IEEE rounding modes.

Even more interesting I think is that __int_as_float and __float_as_int cost 0 cycles…

alex_dubinsky · November 8, 2008, 6:43pm

Let’s not get elitist here. If Granlund&Montogmery wrote a dense, indecipherable (for a programmer) paper, and Hacker’s Delight explained the same thing and made it accessible, then it’s not a “regurgitation.” People should be referred to hacker’s delight and not the “original.” (This is a total rant, but I hate how in science people always have the poser gene in them and have to refer to the “old school” the “first” etc paper or research. And when someone does the invaluable job of recasting the same, then – their work is equated with chewing something, swallowing it, and vomitting it back out? It holds science back!)

alex_dubinsky · November 8, 2008, 6:55pm

Unfortunately with the way IEEE defined a float’s exponent (it’s not two’s complement, it’s a biased positive integer), reinterpreting an int (eg 10) as a float gets you a very tiny number (eg 1e-44). Moreover, you get a denormal, which I don’t think is even supported by cuda (so actually, you just get 0). (You have to bitwise-and the exponent into the int before reinterpreting, which is what someone was saying regarding x86.)

E.D_Riedijk · November 9, 2008, 11:27am

denormals are supported for doubles on the double precision unit on GT200 architecture.

Sylvain_Collange · November 9, 2008, 7:22pm

My point was that on the GPU you can afford to reinterpret floats as ints to perform bit masking and arithmetic operations on it, since there is no penalty associated with typecasting.

This is not the case on most current CPUs. Reinterpreting a float as an int typically adds a few cycles of latency, or more on architectures with separate integer and FP register files (e.g. AMD K8 and K10). So it is often more efficient on those to perform a few more FP operations than work directly on the mantissa and exponent.

Whereas on the GPU you are free to perform any dirty hack on FP numbers :)

Reimar · November 10, 2008, 7:55am

Well, I personally have a problem with both. One treats it as a complicated mathematical problem while the other one treats it as “magic”.

The idea is trivial: It is just multiplication by the inverse, and the inverse is represented as 32 bit fixed-point number, so e.g. instead of a / 12 you write a * ((1 / 12) * 2^35) / 2 ^ 35.

alex_dubinsky · November 10, 2008, 6:36pm

Say, that there makes sense

Topic		Replies	Views
Bug with integer division? CUDA Programming and Performance	33	9351	September 9, 2015
Is it possible to replace integer division by floating-point division for speed CUDA Programming and Performance cuda	9	2329	March 24, 2022
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	49892	August 24, 2009
error when trying to use half (fp16) CUDA Programming and Performance	16	20022	October 13, 2015
Forward looking GPU integer performance CUDA Programming and Performance	22	21586	March 20, 2017
Division problem (weird behavior) CUDA Programming and Performance	23	18034	November 15, 2010
How slow is integer division and modulo? CUDA Programming and Performance	11	10956	September 23, 2008
Speed comparison of division compared to other arithmetic operations, perhaps something like clock cycles CUDA Programming and Performance	9	5755	November 19, 2024
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19929	March 12, 2014
error in modulo operation CUDA Programming and Performance	12	16099	September 20, 2009

integer division and modulo

Related topics