A small summary of a new application category for CUDA… number theory!

One of the more practical niche problems in number theory has to do with identification of prime numbers. Given N, how can you efficiently determine if it is prime or not? This is not just a thoeretical problem, it may be a real one needed in code, perhaps when you need to dynamically find a prime hash table size within certain ranges. If N is something on the order of 2^30, do you really want to do 30000 division tests to search for any factors? Obviously not.

The common practical solution to this problem is a simple test called an Euler probable prime test, and a more powerful generalization called a Strong Probable Prime (SPRP). This is a test that for an integer N can probabilistically classify it as prime or not, and repeated tests can increase the correctness probability. The slow part of the test itself mostly involves computing a value similar to A^(N-1) modulo N. Anyone implementing RSA public-key encryption variants has used this algorithm. It’s useful both for huge integers (like 512 bits) as well as normal 32 or 64 bit ints.

The test can be changed from a probabilistic rejection into a definitive proof of primality by precomputing certain test input parameters which are known to always succeed for ranges of N. Unfortunately the discovery of these “best known tests” is effectively a search of a huge (in fact infinite) domain. In 1980, a first list of useful tests was created by Carl Pomerance (famous for being the one to factor RSA-129 with his Quadratic Seive algorithm.) Later Jaeschke improved the results significantly in 1993. In 2004, Zhang and Tang improved the theory and limits of the search domain. Greathouse and Livingstone have released the most modern results until now on the web, at http://math.crg4.com/primes.html, the best results of a huge search domain.

But more searching can always improve the best known algorithm, it’s embarrassingly parallel since it’s mostly an exhaustive search, computing and testing SPRP candidates… quadrillions of them. This has always been done on CPUs.

But here on the CUDA forum, we all know we have massive multiprocessors at our command. Can they be used for modular power computation and therefore SPRP tests?

The answer, of course, is yes. The NVidia hardware is not designed for efficient integer computation (slow integer multiplies, painfully slow divides and mod computes, no 64 bit native math). But even with this design limitation, its multiprocessing bounty makes it incredibly effective. A modern G280 card is about 8X as fast as a modern CPU. That’s not as impressive as many other math speedups we’ve seen, but remember this is using the relatively weak integer capabilities of the hardware.

The previous best known SPRP tests are summarized on the great page at http://primes.utm.edu/prove/prove2_3.html as well as Greathouse’s improvements at http://math.crg4.com/primes.html.

The newest CUDA-computed results have already more than doubled these previous best known limits. This will have a practical benefit for all applications (on any platform or language) that need to determine primality. The search continues even now.

The bottleneck computation in the search is quickly finding values of A^N mod M for 32 or 64 bit inputs. 64 bit integer math is especially useful on a CPU (even for the 32 bit version) but with CUDA you have to work around those limits.

This work was more of a side diversion but one that worked surprisingly well. The search could easily be done on a CPU, but the CUDA hardware proved itself to handle the problem even more efficiently, despite the fact that integer math isn’t the hardware’s strength. Kudos to the NVidia engineers, both hardware and especially software, who have made an awesome platform!

I probably will not pursue it, but I am now certain that the well known integer factorization algorithms could be especially efficiently done in CUDA and NV hardware. In fact there are algorithms that would extend perfectly to the CUDA hardware and even avoid the weaknesses of the slow integer mults, divides, and mod calls. Like all CUDA apps, the computes can of course be one on the CPU instead, but the GPU’s massive parallelism is perfectly suited for these kinds of enumerative searches.