I’m searching for an algorithm which is suitable for following problem:
I want to calculate the remainder r of a/d.
0 < d < 2^80
0 <= r < d
0 <= a <= 2 * r^2
(so d and r are 80 bits and a is 160bits big)
d is different for each thread (a and r aswell)
I haven’t decided a data format yet.
Maybe 5 (10) times 16 bits for d,r (a)?
Or 3 (6) times 30bits (from 32bits) for d,r (a) leaving 2 bits per int for carry?
If the per-thread denominators are reused for many computations then you could use Montgomery multiplication to interleave forming the products and dividing by a given d. The other alternatives are either
Barrett’s algorithm, essentially multiplying by an integer inverse, and
long division using Knuth’s algorithm. This is pretty complex but can be surprisingly fast.
Barrett’s algorithm is described in the Handbook of Applied Cryptography (available online), and code for Knuth’s algorithm is in the msieve source tree. There’s even CUDA code for 96-bit modular multiplication :)
yes, I also think Montgomery multiplication should be the best solution.
You can represent 80-bit integer by four 24-bit chunks because GPU has native 24-bit multiplication and you have enough space for carry propagation.
To multiply 80-bit integers (required for Montgomery reduction) you can use some splitting scheme, i.e. Karatsuba or Toom-Cook.
as a basic “building block” you will need to multiply 24-bit numbers, this can be done as follows:
computes 48-bit result: a * b + c (where a and b are 24-bit numbers):
lo = __umul24(a, B) + c; // lower 32-bits
hi = __umul24hi(a, B); // upper 32-bits (overlap)
hi = (hi >> 16) + (lo < c); // account for carry
hi = __umul24(hi, 0x1000000 + (1 << 8)) + (lo >> 24); // remap to 24-24 bit range (single mad24.lo instruction)
lo &= 0xffffff;
Hmm… how much work was it to integrate new intrinsics? Do you have a pure “diff” of the changes you made available for download? I feel my fingers tickling, wanting to change a few things myself ;)
Also how difficult would it be to merge this with the toolkit V2.3 NVCC version?
I added ‘diff’ sources to the web-page, now the compiler is also compatible with CUDA 2.3.
and I hope if you apply some modifications yourself, you will make them publicly available as well ;)
for instance it would be nice to add ‘red’ instruction - to evaluate its performance, it is not a native instruction
but decuda fails to decode it, so there should be smth interesting happening inside…