More number theory using CUDA

My first attempt at learning CUDA is a little unusual: accelerating one of the computationally intensive parts of the general number field sieve algorithm for factoring integers. Most of the interest is either in the sieving or the linear algebra stages, because those take the most time, but my initial work is on the first stage: polynomial selection.

Last year about this time, Thorsten Kleinjung described a new polynomial selection algorithm that improved noticeably on the state of the art (also invented by him), and is described here. I’ve implemented Kleinjung’s improved algorithm in the Msieve library (project page) and it turns out the core of the algorithm is very amenable to a GPU: small dataset, lots of arithmetic, regular memory access pattern, etc.

The good stuff is in branches/msieve-gpu/gnfs/poly/stage1 , currently under heavy development. I have CPU and GPU code that performs the same computations, and compared to a 1.86 GHz core2duo even an older 9800GT card performs 3.8x faster. And this in spite of the well-documented problems nvcc has with multiprecision arithmetic.

It’s really impressive how quickly an amateur like me can get up to speed on programming Nvidia’s cards, and this forum has a lot to do with that. Keep the great discussions going, I’m learning a bunch!

Today was the final day of the GPU technology conference. Kaiyong Zhao and I found each other and had a fun time talking about small (2^100), medium (2^1000) and large (2^2^20) modular math on the GPU. He presented a poster for medium-sized modular math in the context of secure hashing. I did a little paper on tiny (2^32) sized modular math for primality proving earlier this year.

The small and large contexts come up in terms of Mersenne prime testing, and that also involves sieving.

There’s a fun thread on simple sieves here on the forum.

You’re not the first number theorist to get fast results from GPUs. Two months ago Daniel Bernstein himself wrote his first CUDA kernel in two days and used a GPU cluster to win a SHA hash matching contest.

There’s brand new “Fermi” cards announced just two days ago. They’re going to boost double precision throughput up to about 600 GFlops/sec. (current cards are about 90 DP GFlops if I remember right.) This will have a big impact in computational number theory of big integers which use Fourier methods for multiplication.

If you visit mersenneforum.org with any frequency, then you can’t miss the legions of people wanting Prime95 to run on a graphics card. OTOH, a $100 older model is plenty to get started if you don’t need double precision.

(For the record, I study computational number theory as a hobby and not in any professional capacity; the work here was a good excuse to get started with GPU programming while avoiding having to write toys)

Bernstein and Lange’s papers on GPU ECM are amazing; I’m not remotely in their league. There are signs Fermi will have faster 32-bit integer multiplication, which will instantly invalidate huge numbers of research papers that attempt to avoid it :(

It does indeed have 1-clock 32 bit multiply-adds with carry. 64 bit mults are 4 clocks… but it’s unknown if there’s a free multiply-add or not. 64 bit add speed is also unknown, it’s either 1, 2, or 4 clocks, likely 2 clocks just because of the register throughput.

The NV guys at the conference get annoyed when you start asking such arcane details since just about nobody cares about integer math throughput except the number theory freaks.

512 cores * 1.5GHz(??) /4 = 192 64-bit GigaIOPS. That is rather unprecedented, especially for a $400 (??) card.

I’m sure all those SSL server farms could drive some sales :)

Thanks for your paper on small modular arithmetic; Kleinjung’s algorithm can be arranged so that the bottleneck is modular multiplication of integers that are 48-80 bits in size. My code handles 64-bit only right now, which is enough to select polynomials for factoring up to ~180-digit (594-bit) RSA numbers. That’s close to the limit of what NFS postprocessing can fit on one machine anyway.

A GPU can provide enough horsepower to let some security relevant bits disappear, eh?
Interesting to hear that the NFS is about to be proted onto a data parallel architecture.
I wrote an elliptic curve discrete logarithm problem solver for the GPU and the results are promising as well.
( http://mrhaandi.thecoderblogs.com/2009/09/…lp-solver-v03a/ )
You don’t even need highend hardware to outperform optimized x64 assembler code.

I wounder what Fermi brings (since multiprecision arithmetics is crucial).