Im working on a project that will require some CUDA aided multi-precision modular arithmetic (integer addition and multiplication).
I have already read about methods like Montgomery’s method or it’s variations like Coarsely Integrated Operand Scanning (CIOS) which interleaves multiplication and reduction, but didn’t find any example or source code at all.
It would be really helpful for me to see such methods implemented - to get in touch with the actual source code.
I was unable to find anything. Do have any tips? Or I would also greatly appreciate if anyone could post just a chunk of code of these methods.