yes, I also think Montgomery multiplication should be the best solution.
You can represent 80-bit integer by four 24-bit chunks because GPU has native 24-bit multiplication and you have enough space for carry propagation.
To multiply 80-bit integers (required for Montgomery reduction) you can use some splitting scheme, i.e. Karatsuba or Toom-Cook.
as a basic “building block” you will need to multiply 24-bit numbers, this can be done as follows:
computes 48-bit result: a * b + c (where a and b are 24-bit numbers):
lo = __umul24(a, B) + c; // lower 32-bits
hi = __umul24hi(a, B); // upper 32-bits (overlap)
hi = (hi >> 16) + (lo < c); // account for carry
hi = __umul24(hi, 0x1000000 + (1 << 8)) + (lo >> 24); // remap to 24-24 bit range (single mad24.lo instruction)
lo &= 0xffffff;
now lo and hi are 24-bit numbers representing the product
__umul24hi intrinsic is originally not available in CUDA, you might want to try out my compiler: