How slow is integer division and modulo?

The programming guide advises that “integer division and modulo operations are particularly costly and should be avoided if possible.” Just how costly is “particularly costly?” Are we talking about hundreds of cycles? Thousands of cycles? More?

My kernel needs to make regular use of integer division and multiplication, but only for small positive numbers; generally indexes of arrays. It’s running much slower than I had hoped, and having basically ruled out non-coalesced memory access, I’m wondering if integer arithmetic might be the culprit. Would it be safe to replace my integers with floating points, given that I should be able to restrict my computation to the mantissa? How expensive is it to cast a float back to an int before using it as an index into an array?

Thanks,
Jason

Haven’t measured division performance, but it is _really_slow. Probably not thousands, but few hundreds of cycles.
I’m using multiplication instead of division where possible.
If you can describe what your kernel needs to do, I might help you a bit.

As an example, I use division to map from a 1-D vector to a 2-D vector. I need to both left-multiply and right-multiply a matrix with two different vectors. The matrix is stored row-first in global memory, and I need to figure out which row and column each thread index (which reads in a single element of the matrix) corresponds to. The matrices processed by different blocks are not of the same size, so I can’t perform the division implicitly by allocating 2-D blocks.

How can you ever use multiplication in place of division? I don’t think I’ve seen such a technique described in the CUDA literature. I’d rather understand the general concept than have any particular problem solved for me. Give a man a fish and he’ll eat for but a day and whatnot.

Thanks,
Jason

Hacker’s Delight Chapter 9. A great book for this kind of low level coding. Fun to read for math geeks.

Compilers will likely do this for you for division by constants, like x/123. Check the PTX code to see what the compiler does.

It’s the arbitrary case of x/y which is very slow.

freak… it is GREAT!!! i start reading it, and in a whisper 30 minutes were gone…

One thing: what is a 0-test? and by | he means or? xor?

AFAIK:

| is an OR

0-test is f (value == 0) or f (!value). Supposed to be slightly faster sometimes? Probably makes a difference if you’re programming a raytracer running on a washing machine :)

i apologize for my ignorance, but what do you mean by f? is a function or something?

Yes, “Hacker’s Delight”. You can adjust method to better suit your needs. For example, in my kernels I usually need to calculate X mod Y where X is in range, say, 1-64M and Y is 2-255. For such scenario it is possible to pre-compute 254 ‘magic’ numbers (one for each Y) and perform division with one multiplication and without any shifts (using __umulhi() ).

That was supposed to be an IF :D Don’t know how that happened …

if (!someValue)

On a Tesla C870:

modulo: 140 cycles

integer division: 124 cycles

int a,b;
a = fdivf(a,b): 16 cycles

mvg: where did you find those timing values? Or did you write a test harness to measure them yourself?

It’d be AWESOME to have a list of all the math op speeds! Though admittedly many of the timings are hard to summarize as a single number. A double multiply may be pretty fast for one eval, but the number of threads with work may change that effective speed (since there’s only one double ALU), etc. Maybe it’s the same for transendentals. Maybe there’s also pipelining latency for sqrt(), etc…

I indeed wrote a test harness, which I will have to check if I am allowed to publish the sourcecode of. I will post some more results tomorrow.