I have been unable to find the number of clock cycles for performing integer division of 32 and 64 bit quantities. Extensive Googling turns up only the mantra; “Integer division and modulo operation are particularly costly and should be avoided if possible or replaced with bitwise operations whenever possible:”. I’ve tried to make measurements from some sample code running on a Tesla C1060 but with inconclusive results.

Does anyone have the real information?

I ask because I’m about to embark in writing some code for multiple precision integer arithmetic and want to know how to get decent performance without having to write several versions of complicated algorithms to see which is likely to get the best performance. For instance:

q = x / y; r = x%y;

can be written as

q = x / y; r = x - y * q;

where a remainder is traded for a multiplication and a subtraction. Testing which is faster is simple enough in single precision (by which I mean unsigned long or unsigned long long) arithmetic but gets rather more difficult in multiple precision, especially as there appears to be no access to quotient and remainder of double length quantities analogous to __umulhi() and __umul64hi() for double length multiplication.

Thanks in advance,

Paul