New number theory results found using CUDA

A small summary of a new application category for CUDA… number theory!

One of the more practical niche problems in number theory has to do with identification of prime numbers. Given N, how can you efficiently determine if it is prime or not? This is not just a thoeretical problem, it may be a real one needed in code, perhaps when you need to dynamically find a prime hash table size within certain ranges. If N is something on the order of 2^30, do you really want to do 30000 division tests to search for any factors? Obviously not.

The common practical solution to this problem is a simple test called an Euler probable prime test, and a more powerful generalization called a Strong Probable Prime (SPRP). This is a test that for an integer N can probabilistically classify it as prime or not, and repeated tests can increase the correctness probability. The slow part of the test itself mostly involves computing a value similar to A^(N-1) modulo N. Anyone implementing RSA public-key encryption variants has used this algorithm. It’s useful both for huge integers (like 512 bits) as well as normal 32 or 64 bit ints.

The test can be changed from a probabilistic rejection into a definitive proof of primality by precomputing certain test input parameters which are known to always succeed for ranges of N. Unfortunately the discovery of these “best known tests” is effectively a search of a huge (in fact infinite) domain. In 1980, a first list of useful tests was created by Carl Pomerance (famous for being the one to factor RSA-129 with his Quadratic Seive algorithm.) Later Jaeschke improved the results significantly in 1993. In 2004, Zhang and Tang improved the theory and limits of the search domain. Greathouse and Livingstone have released the most modern results until now on the web, at, the best results of a huge search domain.

But more searching can always improve the best known algorithm, it’s embarrassingly parallel since it’s mostly an exhaustive search, computing and testing SPRP candidates… quadrillions of them. This has always been done on CPUs.

But here on the CUDA forum, we all know we have massive multiprocessors at our command. Can they be used for modular power computation and therefore SPRP tests?
The answer, of course, is yes. The NVidia hardware is not designed for efficient integer computation (slow integer multiplies, painfully slow divides and mod computes, no 64 bit native math). But even with this design limitation, its multiprocessing bounty makes it incredibly effective. A modern G280 card is about 8X as fast as a modern CPU. That’s not as impressive as many other math speedups we’ve seen, but remember this is using the relatively weak integer capabilities of the hardware.

The previous best known SPRP tests are summarized on the great page at as well as Greathouse’s improvements at

The newest CUDA-computed results have already more than doubled these previous best known limits. This will have a practical benefit for all applications (on any platform or language) that need to determine primality. The search continues even now.

The bottleneck computation in the search is quickly finding values of A^N mod M for 32 or 64 bit inputs. 64 bit integer math is especially useful on a CPU (even for the 32 bit version) but with CUDA you have to work around those limits.

This work was more of a side diversion but one that worked surprisingly well. The search could easily be done on a CPU, but the CUDA hardware proved itself to handle the problem even more efficiently, despite the fact that integer math isn’t the hardware’s strength. Kudos to the NVidia engineers, both hardware and especially software, who have made an awesome platform!

I probably will not pursue it, but I am now certain that the well known integer factorization algorithms could be especially efficiently done in CUDA and NV hardware. In fact there are algorithms that would extend perfectly to the CUDA hardware and even avoid the weaknesses of the slow integer mults, divides, and mod calls. Like all CUDA apps, the computes can of course be one on the CPU instead, but the GPU’s massive parallelism is perfectly suited for these kinds of enumerative searches.

Congratulations! Interesting info, too.

And nevertheless, CUDA (and computer is general) is absolutely useless in number THEORY. You can find resistant preusodo-prime faster? How this advances the theory?

On the other hand, implementation of old algorithms using CUDA is a good work. Feel free to share sources. May be in case of CUDA, number multiplication via Fast Fourier Transform is faster?

Have these results been published anywhere? A report or website?

  • SG

Not yet, since they’re still cooking to improve them even further. But Greathouse and Livingston used the web to share their progress, and I’ll probably follow their lead and publish the results online.

Who is “they”? Is this your work? Some university, company?


very interesting!
I’m doing the same work just now,

“They” is the search processes on my computers. They’re background jobs that keep searching 24/7. Even now my laptop’s keyboard is a noticably warm as it looks for even better SPRP bases to improve the results even further. (Even my laptop’s mobile GPU is faster than the CPU, though only by about 2X)

My main CUDA (and regular CPU/SSE/threading R&D) is all in rendering, monte carlo, random sampling, etc. My number theory is from college… I did several projects then but the state of the art results then were from Crays, way past anything you could do on a workstation’s 20Mhz 68020. This was before it was common to do distributed processing on clusters.

Last month, when I happened to notice the modern best-known SPRP limits, I realized the problem could be attacked with CUDA, and after a quick one-day hack it was even working. Then, as we all know, like many little hack projects, the success led to optimization, deeper design, improvements, research, etc, way past what I expected to work on. I had new code that was fast enough to broke the best-known results, and a few days later, it found yet better limits.

The interesting thing is that the results have practical use, every new improvement means that everyone else’s prime testing (for hash tables or crypto or number theory) is that much faster.

Sorry to revive this dead horse-thread ;)

But what has come out of your research so far? Has anything been published in the mean time? Is any code available now?


Sure, attached is a tech report.

Like all projects, looking back at it I would attack it in a totally different way now… my core math computations could be completely replaced with code that’s probably 5X faster as a guess.

My focus is more of raytracing and particle simulation, but this was a kind of hobbyist side project one week.

I’ll dig up and post the code as well.

worley_SPRPsearch.pdf (47.2 KB)

Thank you for your response. I would actually be very interested in the code. Does it require compute capability 1.3 or would it be able to run also on my armada of CC 1.1 cards?


I clicked through this link and noted this:

“Note (October 9, 2008): My calculations are incorrect and I am withdrawing my claim to an efficient set of bases. See notes by QU Shun Liang (English) for details, as well as a possible replacement set of bases.”

One of their posted 5-set of bases was wrong, oops. ;)


Yeah, I mention that in a footnote. It doesn’t affect their 2, 3, and 4 base results though. Of course those are the results my program found much better values for.

It’s all compute 1.0 stuff.

I’ll post up the code when I get it organized… I can put it in the CUDAzone too.

Of course as I said before now I know how to do it much faster. Christian, you’re trying to distract me with a fun weekend project again, aren’t you?