How slow is integer division and modulo?

rolfe · September 22, 2008, 12:23pm

The programming guide advises that “integer division and modulo operations are particularly costly and should be avoided if possible.” Just how costly is “particularly costly?” Are we talking about hundreds of cycles? Thousands of cycles? More?

My kernel needs to make regular use of integer division and multiplication, but only for small positive numbers; generally indexes of arrays. It’s running much slower than I had hoped, and having basically ruled out non-coalesced memory access, I’m wondering if integer arithmetic might be the culprit. Would it be safe to replace my integers with floating points, given that I should be able to restrict my computation to the mantissa? How expensive is it to cast a float back to an int before using it as an index into an array?

Thanks,
Jason

AndreiB · September 22, 2008, 6:26pm

Haven’t measured division performance, but it is _really_slow. Probably not thousands, but few hundreds of cycles.
I’m using multiplication instead of division where possible.
If you can describe what your kernel needs to do, I might help you a bit.

rolfe · September 22, 2008, 7:55pm

As an example, I use division to map from a 1-D vector to a 2-D vector. I need to both left-multiply and right-multiply a matrix with two different vectors. The matrix is stored row-first in global memory, and I need to figure out which row and column each thread index (which reads in a single element of the matrix) corresponds to. The matrices processed by different blocks are not of the same size, so I can’t perform the division implicitly by allocating 2-D blocks.

How can you ever use multiplication in place of division? I don’t think I’ve seen such a technique described in the CUDA literature. I’d rather understand the general concept than have any particular problem solved for me. Give a man a fish and he’ll eat for but a day and whatnot.

Thanks,
Jason

SPWorley · September 22, 2008, 9:25pm

Hacker’s Delight Chapter 9. A great book for this kind of low level coding. Fun to read for math geeks.

Compilers will likely do this for you for division by constants, like x/123. Check the PTX code to see what the compiler does.

It’s the arbitrary case of x/y which is very slow.

mascarpone · September 22, 2008, 9:59pm

freak… it is GREAT!!! i start reading it, and in a whisper 30 minutes were gone…

One thing: what is a 0-test? and by | he means or? xor?

kristleifur · September 22, 2008, 10:40pm

AFAIK:

| is an OR

0-test is f (value == 0) or f (!value). Supposed to be slightly faster sometimes? Probably makes a difference if you’re programming a raytracer running on a washing machine :)

mascarpone · September 22, 2008, 11:08pm

i apologize for my ignorance, but what do you mean by f? is a function or something?

AndreiB · September 23, 2008, 6:53am

Yes, “Hacker’s Delight”. You can adjust method to better suit your needs. For example, in my kernels I usually need to calculate X mod Y where X is in range, say, 1-64M and Y is 2-255. For such scenario it is possible to pre-compute 254 ‘magic’ numbers (one for each Y) and perform division with one multiplication and without any shifts (using __umulhi() ).

kristleifur · September 23, 2008, 8:55am

That was supposed to be an IF :D Don’t know how that happened …

if (!someValue)

mvgalen · September 23, 2008, 9:44am

On a Tesla C870:

modulo: 140 cycles

integer division: 124 cycles

int a,b;
a = fdivf(a,b): 16 cycles

SPWorley · September 23, 2008, 3:25pm

mvg: where did you find those timing values? Or did you write a test harness to measure them yourself?

It’d be AWESOME to have a list of all the math op speeds! Though admittedly many of the timings are hard to summarize as a single number. A double multiply may be pretty fast for one eval, but the number of threads with work may change that effective speed (since there’s only one double ALU), etc. Maybe it’s the same for transendentals. Maybe there’s also pipelining latency for sqrt(), etc…

mvgalen · September 23, 2008, 10:14pm

I indeed wrote a test harness, which I will have to check if I am allowed to publish the sourcecode of. I will post some more results tomorrow.

Topic		Replies	Views
integer division and modulo CUDA Programming and Performance	14	23350	November 10, 2008
Is it possible to replace integer division by floating-point division for speed CUDA Programming and Performance cuda	9	2818	March 24, 2022
Timing for division and remainder CUDA Programming and Performance	1	1194	December 21, 2009
Speed of modulo operator in CUDA CUDA Programming and Performance	5	4636	September 13, 2019
Why is modulus so slow? CUDA Programming and Performance	9	5107	May 21, 2010
Speed comparison of division compared to other arithmetic operations, perhaps something like clock cycles CUDA Programming and Performance	9	6822	November 19, 2024
Modulo Cost CUDA Programming and Performance	3	2290	April 3, 2009
error in modulo operation CUDA Programming and Performance	12	16286	September 20, 2009
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	50464	August 24, 2009
Bug with integer division? CUDA Programming and Performance	33	9853	September 9, 2015

How slow is integer division and modulo?

Related topics