Parallelizing multiplication

I’m currently working on a school project where I’m implementing RSA using CUDA. My task is to make a performance comparison between normal CPU processing vs. the parallel processing of the GPU.

The main problem I need to look into is multiplication of big numbers which is what makes RSA encryption so slow compared to symmetric encryption. Does anyone have any experience, ideas or thoughts around parallelizing multiplication of big numbers in CUDA?

You could look at the reduction kernel in the SDK, but the problem with multiplication (vs. summation) is that you could (depending on the numbers involved) run into the maximum value of even the 64-bit integer very quickly.

However, if you know you’re not going to overflow the 64-bit int, then just use the reduction example and you’ll get a good speedup.

I think the problem is multiplicating two large (e.g. 2048 bit) numbers, not multiplying lots of numbers onto each other.

My plan was to use arrays of 1’s and 0’s and perform operations on those instead of using integers. The numbers in intermediate calculations in for example RSA encryption will be even larger then 2048 bits if I have understood it correctly.

I will look into the reduction example though, maybe there’s some inspiration to be gained.

Well, I don’t know a ton about the algorithms for multiplying arbitrarily large integers, but perhaps you could take your large (say, 2048-bit) numbers, and represent them as an array/struct of 64-bit numbers in shared memory, then have each thread do some kind of binary operations (and/or/xor/shift etc.) on pairs of individual numbers until you get the actual product of the original large numbers (sort of like implementing a huge multiplier logic circuit in CUDA code).

Maybe I misunderstood you, but these ideas sound completely crazy to me. You absolutely will want to reuse the GPU’s multiplication function, I am really sure any RSA code you can find will do that on the CPU too.
Multiplying large numbers is actually quite basic math, you probably do it all the time, e.g. when you calculate
3416 = (310 + 4) * (110 + 6) = 310*(110 + 6) + 4 * (110 + 6) = (31)(1010) + (64 + 4 1) * 10 + 46
Just that on the CPU you use e.g. 2^32 instead of 10.
You’ll probably still need some time to find a nice way to parallelize that, particularly the small shared memory might make this a bit annoying.

Perhaps you’re right, but here’s the way I figured it…a number larger than what the GPU natively supports (say, a 256-bit integer) could be broken down into 4 64-bit integers, or 8 32-bit integers, and some operation performed on each ‘piece’ of the two numbers (i.e. piece 1 of int a and piece 1 of int B) in parallel by the CUDA threads. When the first step is done, the results will need to be adjusted for any overflows from the ‘piece’ operations, so __syncthreads() is called and the threads adjust their values according to whatever overflow results have been stored in shared memory.

I don’t know if this way would be faster than doing a multiplication as you suggested, but my way is essentially how the multiplication operation is implemented in hardware (i.e. how the CPU does it). I might try to write some test code this weekend to see if I can make my way work, since if it does work, CUDA parallelism should allow it to multiply very large numbers (i.e. whatever the memory can store) fairly quickly.


Unfortunately, to my knowledge, there’s no way to cleverly implement any kind of encryption with SIMD paradigm, at least using modern versions.

Bit-wise moltiplication is rather slow, especially as your first implementation: you would multiply just one bit per cycle, than you would have to update the data before multiplying again.

SIMD is good just for craking encryption…

first, you do not need syncthreads if you restrict yourself to 32 threads accessing the same memory area, which you should do as much as possible.

Also multiplication is not acting “local” so you must apply “some operation” not only to “piece 1 of int a and piece 1 of int b” and then the same for the pieces 2, 3, etc. but also all combinations, e.g. “piece 1 of int a and piece 3 of int b”.

So the number of operations increases with the square of the number of pieces you have.

Anyway, here is a piece of (untested) code you can think about when your realise your idea probably will not work quite as well as you hope.

uint32_t __shared__ a[32];

uint32_t __shared__ b[32];

uint32_t __shared__ res[64];

uint32_t __shared__ carry[64];

a[threadIdx.x] = a_global[threadIdx.x];

b[threadIdx.x] = b_global[threadIdx.x];

res[threadIdx.x] = 0;

carry[threadIdx.x] = 0;

res[32+threadIdx.x] = 0;

carry[32+threadIdx.x] = 0;

for (i = 0; i < 32; i++) {

  uint64_t tmp = ((uint64_t)a[i]) * b[threadIdx.x];

  uint32_t low = tmp;

  uint32_t high = tmp >> 32;

  res[i+threadIdx.x] += low;

  if (res[i+threadIdx.x] < low) carry[i+threadIdx.x+1]++;

  res[i+threadIdx.x+1] += high;

  if (res[i+threadIdx.x+1] < high) carry[i+threadIdx.x+2]++;


if (threadIdx.x == 0)

  for (i = 1; i < 63; i++) {

    res[i] += carry[i];

    if (res[i] < carry[i]) carry[i+1]++;


Yes, CUDA can be used, but a truly FAST implementation will be a lot of work to get working properly. The crucial trick is to change integer multiplication (which is O(n^2) serial algorithm where n is the number of bits) into an operation using a parallel FFT and FLOATING POINT numbers. This isn’t an obvious transform but it’s the right way to get performance.

The idea of using FFTs for integer multiplication comes from the realization that multiplication is a kind of convolution of two series, and convolution can be done in the frequency domain much more efficiently.

There is excellent optimized CPU code for doing this, used by the Mersenne project: see
There’s lots of subtleties, especially about floating point error large enough to affect the final computed integer value when de-FFTed.

CUFFT gives you an FFT on the GPU already.

One question is whether or not a 2048 bit RSA is a large enough multiplication to make it worthwhile to go for the FFT route or not. It’s unclear where the breakeven would be… but that question makes it a very interesting project.

One thing that would be neat if this worked out, using orjanb’s method or SPWorley’s method, would be to use CUDA to find prime factorizations of a number (or simply test a number for primality). That in itself could have interesting encryption implications…

you introduce error that, although small, make the encryption useless.

Most of the implementation complications come from exactly quantifying the floating point errors and fixing them up when they happen. The output results therefore have bounded floating point error, which when quantized back to integers, gives you guaranteed integer results.

This is easier to do with double precision, but even float precision works quite well and can be more efficient (more single precision error requires smaller batches but floats are usually faster to compute with SSE or CUDA.)

Please refer to this paper for more details.…0Transforms.pdf

For exact huge-integer multiplication and powers, floating point FFT methods are the method of choice, and that makes them admirably suited to CUDA implementation. The biggest question is whether a “mere” 2048 bit number is large enough to make the FFT strategy worthwhile.

Wow, this guy wanted to use one-int-per-bit and you throw “well, multiplication can actually be an operation applied to data transformed using trigonometric functions into the frequency domain.” (p.s. holy crap. how does that work?)

I think a) let’s not do too much of his homework. b ) the analogy is exactly like multiplying numbers longhand out on paper. (They still teach that, right? lol). You can do this using binary, and represent one-bit-per-int. Might as well start out trying that, in fact. Or you could do it using bigger digits, bigger than even decimal, as big as will fit into an int or int64. Try it and you’ll figure it out. Since in the end it’s all just a bunch of adding (with carry), and you know which operation will end up where, i think it’d parallelize ok. Of course, it’d be even faster to do it serially in one kernel and just do a thousand multiplications at once.

I works because “convolution in the time domain is multiplication in the frequency domain” as the “saying” goes.

The stuff you do when you multiply a large number by multiplying the smaller pieces (what you called “multiplying numbers longhand out on paper”) and adding them up is actually the same thing as doing a convolution (a bit hard to see unless you write out the formulas though).

And the FFT is what does the conversion between time and frequency domain and back.

That said, as nice as it is in theory, this is for really huge numbers (like 200000 bits), RSA usually only involves just large numbers (I think never more than ca. 8000 bits) because FFT is complicated with lots of overhead and more complicated memory access patters - which are particularly bad on GPUs (and for humans, which is why I have never seen anybody using it on paper :) ).