I am trying to implement 1024 bit RSA encryption on CUDA. in order to do this, i need to perform wide operations such as 1024-bit wide multiplication . Can anyone help me? how can this be made parallel?
also, is there an efficient way to get the modulus for a wide division such as this (in parallel?). I mean like “2048bits”%“1024bits”.
I am trying to implement 1024 bit RSA encryption on CUDA. in order to do this, i need to perform wide operations such as 1024-bit wide multiplication . Can anyone help me? how can this be made parallel?
also, is there an efficient way to get the modulus for a wide division such as this (in parallel?). I mean like “2048bits”%“1024bits”.
One thing one could do to parallelize the multiplication is to accumulate columns of partial products in parallel, then propagate the carries between columns at the end. If one were to do this at the HLL level rather than in PTX, multiplying 32x32->64 bits using integer multiply plus __umulhi() and accumulating in a 96-bit per-column accumulator seems a reasonable first cut. A single multiplication is unlikely to fill the GPU unless the numbers are much larger than 1024 bits, at which point one might want to go with FFT-based multiplication. There are many different ways of multiplying (e.g. Karatsuba, Took-Cook, FFT) and exponentiating (e.g. binary, sliding window) so some experimentation seems called for to find the switchover points between the various methods. It’s also a good idea to peruse the existing literature first, as there may already be published papers about RSA on the GPU.
Note that one typically does not compute a full modulo computation for the reduction step in RSA (cf. Montgomery and Barrett reductions). If you do need a full modulo operation, it could be mapped back to multiplication via Newton-Raphson based division and back-multiply.
[following added later]
Here is a failry recent paper that looks like it could be a reasonable starting point (I have not read this, I am just going by the abstract):
Owen Harrison, John Waldron. Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware.
In: Proceedings AfricaCrypt 2009, June 21-25, 2009, pp. 350 - 367
Abstract: […] We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput.[…]
One thing one could do to parallelize the multiplication is to accumulate columns of partial products in parallel, then propagate the carries between columns at the end. If one were to do this at the HLL level rather than in PTX, multiplying 32x32->64 bits using integer multiply plus __umulhi() and accumulating in a 96-bit per-column accumulator seems a reasonable first cut. A single multiplication is unlikely to fill the GPU unless the numbers are much larger than 1024 bits, at which point one might want to go with FFT-based multiplication. There are many different ways of multiplying (e.g. Karatsuba, Took-Cook, FFT) and exponentiating (e.g. binary, sliding window) so some experimentation seems called for to find the switchover points between the various methods. It’s also a good idea to peruse the existing literature first, as there may already be published papers about RSA on the GPU.
Note that one typically does not compute a full modulo computation for the reduction step in RSA (cf. Montgomery and Barrett reductions). If you do need a full modulo operation, it could be mapped back to multiplication via Newton-Raphson based division and back-multiply.
[following added later]
Here is a failry recent paper that looks like it could be a reasonable starting point (I have not read this, I am just going by the abstract):
Owen Harrison, John Waldron. Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware.
In: Proceedings AfricaCrypt 2009, June 21-25, 2009, pp. 350 - 367
Abstract: […] We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput.[…]
I have an algorithm to find the multiplication of N * N numbers wit horder of 2N-1. Sample result of my algorithm will look like below. Contact me if you need more details about that.
If you would like to mutiply large mumbers, in the 1980’s I and Luther Welch searched for Mersenne Primes using the Lucus-Lehmer test. This involves squareing large integers, in our case about 33,000 digits (now consisidered trival). GIMPS can square a 33,000 digit number 110,502 times taking the remainder MOD 2^110503 in just 11 seconds. You need to learn about FFT mutiplication where you will be able to add each complex pair in complete parallel. I don’t know if little 1024 bit numbers (about 333 digits) are large enough to offset the to/from to the complex FFT. I think worth a try however.
I have a basic question on implementing RSA using FFT: such as for rsa1024, should i do 1024 multiplication in frequency domain, finally inverse Fourier transform at the end, or i did inverse fourier transform immediately following fft for each multiplication? i know the first way may be wrong, at the same time i think overhead of doing FFt and then inverse FFt for each multiplication is big, want to reduce the overhead to minimum so that the intermediate multiplication is done in floating point only. pls comment if i am wrong. thanks!
It must be the latter. Multiplying two numbers is a convolution operation, and you have to transform back into the integer domain to preserve full accuracy in the product. There are alternative techniques like RNS (residue number systems) that avoid having to release carries across an entire multi-word integer, but you can’t use the FFT on them.
Wouldn’t it also be worthwhile to point to one or several of your relevant papers? I assume you are the same “Niall Emmart” who published on these topics a couple of years back …