Prime95 and CUDA Want to make a CUDA Mersenne Prime Test

[B]Hi! I want to start a code to develop a Mersenne Prime Test
program for 8xxx graphics cards at SLI. Anyone wants to help?
Maybe we can get a boost with all these co-processors.

If you are interested, there is a thread about this:

– Nicolas

Prime95 uses a discrete weighted fourier transform, which is currently implemented using 64 bit floating point arithmetic. From what I understand, Prime95 is currently trying to find 10+ million digit primes, and is actually reaching the limit on the size of the number it can calculate using 64 bit precision (its peak is around 23 million digits.)

From what I understand, 8800’s only operate on 32 bit floats, meaning it could be ported, but the number of digits that could be calculated within that smaller precision would be lowered significantly.

You can implement higher precision arithmetic with lower precision floats. The dsfun90 library is one example of how to do this. (I recently had to port the addition and multiplication subroutines from dsfun90 to CUDA so I could implement an exponential with 10^-12 precision using the 10^-7 precision floats.)

Of course, it isn’t as fast as dedicated silicon would be. Adding two pseudo-double precision numbers using the dsfun90 method takes 11 single precision additions, and multiplication takes somewhere around 25 operations since the G80 can do a single-precision mutliply-add in one instruction. There may be other extended precision representations which are more efficient.

I found this page on the FFT method which programs like Prime95 use to do their multiplication of large numbers. Makes for an interesting read:…rithms/fft.html

Since G80 GPUs also support integer arithmetic (and a full set of bit-wise ops), you can construct arbitrary precision arithmetic out of 32-bit ints. I have no experience with this myself, so I’m not sure if that would be better than using single precision floats, but it seems like it could be.


Being able to multiply and add floats in 2 clock cycles is helpful. It looks like the problem with using 32 bit ints is that it takes 8 clock cycles to do a multiplication. Using the fast 24 bit multiplies would work, though that’s pretty much the float multiply with no exponent.

Yes, good point. ints do support shifts and logic ops, though, in case that helps.


Hm, just wondering - how does the mul24 handle overflows? Wrap around in 24bit like the standard mul does on 32bit?

(Could try myself - just too lazy :whistling: )


They are described in Appendix A of the CUDA manual.

“__mul24(x,y) computes the least significant 32 bits of the product of the least significant 24 bits of x and y”

So it’s a mixed precision operation, I guess. Also interesting is the __mulhi(x,y) operation which gives the most significant 32 bits of the 32 bit x 32 bit multiplication. Does __mulhi() also take 8 clock cycles?

Yeah but that’s not the answer to my question.

So I finally looked it up. The answer is in the following sentence of the cited paragraph. Unfortunately mul24 does not behave like the 32bit mul. If the operands overflow, the result is undefined. So you will have to do a mulhi first, check if overflow occurred and then do the mul24 on the possibly truncated operands.

I assume that mulhi takes 2 clock cycles (the manual doesn’t tell). So I get 2 for mulhi, 2x2 for the & masks and 2 for mul24. This is 8 cycles like the 32bit operation. So I assume it is implemented via these “primitive operations”.

So mul24 without checks is safe on numbers that cannot excess 24 bit. blockIdx, blockDim and threadIdx are such. Hint: address calc for mem access