What properties does a problem suitable for CUDA have?

orjanb314 · November 10, 2008, 12:45pm

This semester I have been working on a school project about public key cryptography. The main goal of my project is to implement RSA in CUDA with the aim of achieving a performance increase vs CPU implementations.

The project is soon coming to an end and I have begun to think that the modular arithmetic in public key cryptography isn’t something that is easily parallelizable. The main function of RSA encryption/decryption is this:

a^b Mod n

where a and n are very large integers. Mathematica spends 92 microseconds to calculate this function and I don’t see how I can compete with that in CUDA, with my novice programming skills.

The most obvious thing to do is to just run many of these calculations in parallel, which would be the equivalent to encrypting several message blocks at once. But the same principle can be applied to symmetric block ciphers to achieve a similar performance increase and the performance gain in the public key implementation would be useless.

So my professor wanted me to look into parallelizing the modular function itself, but it seems like it isn’t a suitable problem for CUDA. A good problem for CUDA would be something where a very large number of parallel computations can be performed on the same set of data. In the modular exponentiation function I quickly run into doing a lot of additions which can’t be run in parallel and I end up with not running that many parallel threads.

Any input on this if anyone bothered to read everything? ;)

alex_dubinsky · November 10, 2008, 7:07pm

Well, you seem to understand everything perfectly.

One thing I’d add is that a great CUDA algorithm is not just one that breaks into many threads, but also makes good use of the GPU’s on-die SRAMs (such as registers, shared mem, and constant mem).

wumpus · November 10, 2008, 8:05pm

Indeed, for a problem to be suitable to CUDA you need to be able to divide the task into a large number of pieces that can run in parallel without needing too much data sharing.
I’m not sure whether or not large integer operations fall into this category, there might be some trick to do it. As I remember, adders and multipliers in chips can also be implemented in various more or less parallel ways. There might be a way to write it as some kind of reduction?

Keldor314 · November 12, 2008, 10:20pm

I can see where you might be able to do something at the warp level to accelerate arithmetic of large numbers, at least for multiplication and addition. The division/modulus might be a problem since it has dependencies in its computation. I don’t know exactly how the exponentiation is computed, but if it ends up using multiplication and addition, those again might be accelerated.

alex_dubinsky · November 13, 2008, 3:16am

I don’t understand this part. I think running jobs in parallel is the best strategy. Even if you optimize the mod function, you’ll only really be able to do it at a block level (you’d share data via smem, etc), and so you’ll want another level of parallelism anyway.

Topic		Replies	Views
Parallelizing multiplication CUDA Programming and Performance	14	7580	October 11, 2008
noob questions their intended 'use case' for parallel programming with CUDA Teaching and Curriculum Support	6	1962	June 20, 2015
Thinking parallel CUDA Programming and Performance	5	1956	January 17, 2010
Any idea for a simple graduation job using CUDA? CUDA Programming and Performance	0	662	March 22, 2013
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2371	December 9, 2011
Factor "RSA like" 100 digits number CUDA Programming and Performance	8	4665	April 29, 2013
Is it doable with CUDA? CUDA Programming and Performance	6	565	December 29, 2019
Some questions about CUDA and my problem CUDA Programming and Performance	3	2357	June 18, 2008
implement parallelism among procedures? CUDA Programming and Performance	1	899	June 22, 2009
CUDA and fixed-point comparaison on big array Is CUDA suitable for fixed-point comparaison? CUDA Programming and Performance	7	2519	May 9, 2011

What properties does a problem suitable for CUDA have?

Related topics