yes, I also think Montgomery multiplication should be the best solution.

You can represent 80-bit integer by four 24-bit chunks because GPU has native 24-bit multiplication and you have enough space for carry propagation.

To multiply 80-bit integers (required for Montgomery reduction) you can use some splitting scheme, i.e. Karatsuba or Toom-Cook.

as a basic “building block” you will need to multiply 24-bit numbers, this can be done as follows:

computes 48-bit result: a * b + c (where a and b are 24-bit numbers):

lo = __umul24(a, B) + c; // lower 32-bits

hi = __umul24hi(a, B); // upper 32-bits (overlap)

hi = (hi >> 16) + (lo < c); // account for carry

hi = __umul24(hi, 0x1000000 + (1 << 8)) + (lo >> 24); // remap to 24-24 bit range (single mad24.lo instruction)

lo &= 0xffffff;

now lo and hi are 24-bit numbers representing the product

__umul24hi intrinsic is originally not available in CUDA, you might want to try out my compiler:

http://www.mpi-inf.mpg.de/~emeliyan/cuda-compiler/