multiple precission integers - how to calculate the remainder


I’m searching for an algorithm which is suitable for following problem:
I want to calculate the remainder r of a/d.
0 < d < 2^80
0 <= r < d
0 <= a <= 2 * r^2
(so d and r are 80 bits and a is 160bits big)
d is different for each thread (a and r aswell)

I haven’t decided a data format yet.
Maybe 5 (10) times 16 bits for d,r (a)?
Or 3 (6) times 30bits (from 32bits) for d,r (a) leaving 2 bits per int for carry?


You basically need a 256b integer library :) Look at the following links:

If the per-thread denominators are reused for many computations then you could use Montgomery multiplication to interleave forming the products and dividing by a given d. The other alternatives are either

  • Barrett’s algorithm, essentially multiplying by an integer inverse, and
  • long division using Knuth’s algorithm. This is pretty complex but can be surprisingly fast.

Barrett’s algorithm is described in the Handbook of Applied Cryptography (available online), and code for Knuth’s algorithm is in the msieve source tree. There’s even CUDA code for 96-bit modular multiplication :)

yes, I also think Montgomery multiplication should be the best solution.
You can represent 80-bit integer by four 24-bit chunks because GPU has native 24-bit multiplication and you have enough space for carry propagation.
To multiply 80-bit integers (required for Montgomery reduction) you can use some splitting scheme, i.e. Karatsuba or Toom-Cook.

as a basic “building block” you will need to multiply 24-bit numbers, this can be done as follows:

computes 48-bit result: a * b + c (where a and b are 24-bit numbers):

lo = __umul24(a, B) + c; // lower 32-bits
hi = __umul24hi(a, B); // upper 32-bits (overlap)
hi = (hi >> 16) + (lo < c); // account for carry
hi = __umul24(hi, 0x1000000 + (1 << 8)) + (lo >> 24); // remap to 24-24 bit range (single mad24.lo instruction)
lo &= 0xffffff;

now lo and hi are 24-bit numbers representing the product
__umul24hi intrinsic is originally not available in CUDA, you might want to try out my compiler:

Hmm… how much work was it to integrate new intrinsics? Do you have a pure “diff” of the changes you made available for download? I feel my fingers tickling, wanting to change a few things myself ;)

Also how difficult would it be to merge this with the toolkit V2.3 NVCC version?


well, this is not very hard to add new intrinsics - at least the ones that do not use memory operands ))

I can try to produce diff with open64 sources, although they are pretty messy - I remember I was playing around

with paths variables for different compilation phases and did not get completely how this all works,

so I decided to add the path to ‘nvopencc’ to PATH variable (otherwise it breaks down)

concerning V2.3 toolkit, I think the compiler should work with it as well (except for ‘membar’ intrinsics which are new)

because nvopencc is only responsible for translating NVISA code, the rest is done by a usual compiler

I added ‘diff’ sources to the web-page, now the compiler is also compatible with CUDA 2.3.

and I hope if you apply some modifications yourself, you will make them publicly available as well ;)

for instance it would be nice to add ‘red’ instruction - to evaluate its performance, it is not a native instruction
but decuda fails to decode it, so there should be smth interesting happening inside…

Doesn’t it expand into the exact same instructions than ‘atom’?

If not, this is indeed interesting. :)

nope )) this is the point. ‘atom’ instructions provide exclusive memory access while

‘red’ is a warp-sized reduction and looks like GPU has some “hidden” opcodes to deal with it…

Hi everybody,

thank you for your hints. :)
asm: Does your __umul24hi still have the benefit that a 24x24 multiplication is faster than a 32x32?


of course )) mul24.hi is a native instruction. In fact 32-bit mul gets demoted in a sequence of

mul24’s on the GPU, you can observe this using decuda