This semester I have been working on a school project about public key cryptography. The main goal of my project is to implement RSA in CUDA with the aim of achieving a performance increase vs CPU implementations.

The project is soon coming to an end and I have begun to think that the modular arithmetic in public key cryptography isn’t something that is easily parallelizable. The main function of RSA encryption/decryption is this:

a^b Mod n

where a and n are very large integers. Mathematica spends 92 microseconds to calculate this function and I don’t see how I can compete with that in CUDA, with my novice programming skills.

The most obvious thing to do is to just run many of these calculations in parallel, which would be the equivalent to encrypting several message blocks at once. But the same principle can be applied to symmetric block ciphers to achieve a similar performance increase and the performance gain in the public key implementation would be useless.

So my professor wanted me to look into parallelizing the modular function itself, but it seems like it isn’t a suitable problem for CUDA. A good problem for CUDA would be something where a very large number of parallel computations can be performed on the same set of data. In the modular exponentiation function I quickly run into doing a lot of additions which can’t be run in parallel and I end up with not running that many parallel threads.

Any input on this if anyone bothered to read everything? ;)

Well, you seem to understand everything perfectly.

One thing I’d add is that a great CUDA algorithm is not just one that breaks into many threads, but also makes good use of the GPU’s on-die SRAMs (such as registers, shared mem, and constant mem).

Indeed, for a problem to be suitable to CUDA you need to be able to divide the task into a large number of pieces that can run in parallel without needing too much data sharing.
I’m not sure whether or not large integer operations fall into this category, there might be some trick to do it. As I remember, adders and multipliers in chips can also be implemented in various more or less parallel ways. There might be a way to write it as some kind of reduction?

I can see where you might be able to do something at the warp level to accelerate arithmetic of large numbers, at least for multiplication and addition. The division/modulus might be a problem since it has dependencies in its computation. I don’t know exactly how the exponentiation is computed, but if it ends up using multiplication and addition, those again might be accelerated.

I don’t understand this part. I think running jobs in parallel is the best strategy. Even if you optimize the mod function, you’ll only really be able to do it at a block level (you’d share data via smem, etc), and so you’ll want another level of parallelism anyway.