GPU sieving help Seeking GPU programming help for the sieves at PrimeGrid

My name is John and I help with project coordination at PrimeGrid. We currently are running several sieves, two of which are within the BOINC platform (gcwsieve & sr2sieve). Both were developed by Geoff Reynolds. The sr2sieve has by far the most impact as it is used for most of the projects at PrimeGrid. One of the biggest projects it supports is the Seventeen or Bust/Prime Sierpinski Problem sieve.

I’m currently seeking someone who can help us port these sieves (primarily sr2sieve) to a GPU application.

I quote from Geoff:

We are hoping that GPU sieving can have a substantial impact on the landscape of prime finding allowing us to sieve much deeper. If there are any programmers who can assist us, please let me know. We at PrimeGrid look forward to any collaboration that may develop.

The source code for sr2sieve can be found here:

The source for the other programs that Geoff mentions can be found here:





AP26 was developed by Jaroslaw Wroblewski and adapted to BOINC by Geoff.

fsieve and psieve were developed by Mark Rodenkirch with support from Geoff.

I cant offer any solid performance numbers but you seem to be aiming for the GPU’s weaknesses!
The GPU likes to work with floating points.
Firstly, I have no idea how an integer > 32 bits would be treated. I think some people here were working on a bigint class, not sure what happened to it.
Secondly, the modulo operation is costly, unless p is a power of 2, in which case the modulo can be rewritten as a bit shift, which is among the fastest operation on the GPU.
Thirdly, that is a fairly low computation/memory transation ratio. The strength of the GPU is crunching numbers, not reading from global memory.
Lastly, an array of 10 000 is relatively small to be treated on the GPU. The GPU keeps busy by having tens of thousands, even millions, of threads active.

Note that im not saying that it would not be faster than a CPU implementation (Im pretty sure it would be), just that its not the ideal CUDA scenario.

Ive gathered some quick data for you

int main(int argc, char** argv)


	int num = 1000000;

	int *inout, *d_inout;

	inout = (int*)malloc(num*sizeof(int));

	cudaMalloc((void**) &d_inout, sizeof(int) * num);

	for(int i=0;i<num;i++)


		inout[i]= rand();


	cudaError_t err;

	unsigned int timer = 0;

	CUT_SAFE_CALL( cutCreateTimer( &timer));

	CUT_SAFE_CALL( cutResetTimer( timer));

	CUT_SAFE_CALL( cutStartTimer( timer));

	cudaMemcpy(inout, d_inout, sizeof(int)*num, cudaMemcpyHostToDevice);

	const dim3 dimBlock(64);


	int dim = (int)ceil((float)(num)/64.0f);

	const dim3 dimGrid(dim);

	modmultcuda<<<dimGrid, dimBlock, 0>>>(d_inout, 135,22,num);

	err = cudaThreadSynchronize();

	if( cudaSuccess != err)												 \

		fprintf(stderr, "Cuda error:  in file '%s' in line %i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err) );

	cudaMemcpy(inout, d_inout, sizeof(int) * num, cudaMemcpyDeviceToHost);

	CUT_SAFE_CALL( cutStopTimer( timer));

	printf("Processing time DosePOV: %f (ms)\n", cutGetTimerValue( timer));

	CUT_SAFE_CALL( cutResetTimer( timer));

	CUT_SAFE_CALL( cutStartTimer( timer));


	CUT_SAFE_CALL( cutStopTimer( timer));

	printf("Processing time DosePOV: %f (ms)\n", cutGetTimerValue( timer));


	return 0;

__global__ void modmultcuda(int *inout, int p, int b, int num)


	volatile int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;

	int val;



		val = inout[idx];

		val = (val*b)%p;



void modmult(int *inout,int p,int b,int num)


	for(int i=0;i<num;i++)





On a 8800GT a core2duo E6550 (app is single threaded)

num = 10 000: CPU:0.08, GPU :0.10

num = 100 000: CPU: 0.87 GPU: 0.52

num = 1 000 000: CPU:8.58 GPU: 3.07

Coded in 15 mins so its all about quality!

Guess you can see where the GPU would shine… 10 000 just isnt enough parallelism.

Also if you have more tasks to be performed on the variables, it would help.

Allieur, you’re awesome for jumping in to help and give those guys fast feedback!

Do you really mean 2^51? Or 2^31?

JM: one very weird part of GPU programming is that your bottlenecks are always in wildly different places than you’d expect.

It’s also usually difficult to try to mix CPU and GPU steps… there’s a distant,narrow, and super-latent pipe connecting the two.

In fact often you can analyze many potential compute impacts just by using quick estimates.

If the CPU can compute a value (say 32 bits? 64 bits? depends on whether you means 2^51 or 2^31) in 5 clocks, let’s assume that the GPU can do it for free! 0 clocks! As many as you like.

If so, our only bottleneck is getting the answers to and from the GPU. How long does it take to send 1M values to and from the GPU?

well, PCIe v2 16x is our best case and we might hope for 4GB/sec from the pipe. 1M values means 8MB of data, it needs to pass through the pipe twice, so we’re going to use 4ms.

4ms is about 12,000,000 CPU clock ticks, and with your measurement of 6 clock ticks per value, that means you can compute 2M values on the CPU.

So the CPU is faster than just the transfer time to and from the GPU if you just want to compare this one quick test.

A more promising approach is to run your entire algorithm on the GPU. Then there’s no PCIe bottlenecks.

Then we get into the question of making a GPU algorithm, which would likely be structured very unlike a CPU version.

This is not a minor task. Fun, but not anything like a recompile or even a port. New algorithms are needed.

Integer math is indeed a (relative) GPU weakness, but only compared to the GPU’s outrageous floating point power.

I have done modular integer power computations on the GPU. For values < 2^32 I could easily beat the CPU, but for larger sizes like 2^40 it was trickier since a GPU’s 64 bit integer math is only 1/8 the throughput as 32 bit. (and there’s some important boosts regarding sub-24 bit values too.).

A trick I’ve used twice to avoid running the entire algorithm on the GPU is to push down another for-loop to the GPU. If computing one vector modular multiplication operation underutilizes the card, look one level up and see if you can push N modular multiplication operations onto the card, one per block.

This may not be applicable if the point of these modular operations is to exponentiate an integer, so you might have to look higher up and exponentiate N integers, one per block.

Hi, I am the Geoff that John quoted above.

Yes it is 2^51.

To get an idea of the work involved, typically sieving for factors up to 2^31 requires a few minutes or a few hours on a single machine, so it is not worth writing special code to take advantage of these small numbers. Sieving for factors up to 2^51 can take hundreds or even thousands of CPU years. In some cases PrimeGrid has already passed 2^51 and it would be useful to have code that works for factors up to about 2^60.

There are a number of ways to implement modular multiplication by a modulus in this range. On the Core 2 the fastest seems to be to use a mix of 64x64->64 bit integer multiplications and double-precision (limited to 2^52) or extended-precision (80-bit long doubles, limited to 2^62) floating point operations. But a method (I think it is called a Newton/Barrett method) using 64x64->128 bit multiplications is almost as fast, limited to 2^63. There is also Montogomery multiplication using 64x64->128 bit multiplication, this is about half the speed of the other methods, but works up to 2^64.

The methods using integer-only multiplication can be implemented using 32x32->64 bit multiplication, but would need about 4 times as many operations. If only 32x32->32 bit multiplication is available then it would need about 8-10 times more operations per modular multiplication. If only single-precicion floating-point operations are used then it would need about 20 times as many operations, but I have no idea how to implement it that way so this is just a guess.

Most of the sieving algorithms look like this from the top:

Let p0,p1 be integers up to 2^51 (or better 2^56, etc.)

For each prime p in the range p0 <= p < p1


  Perform thousands or millions of multiplications modulo p,

  In the very rare case that this leads to a factor being discovered,

	Write the factor to file.


There are many methods of generating the primes p, the fastest ones usually use a lot of memory but there are methods that use very little memory but a lot of computation. In many cases it is not necessary for p to actually be prime, just not to have any small factors (but some algorithms do require that p is really prime). For most of the projects that PrimeGrid is currently involved in the time spent generating primes p is very small (< 2%) compared to the time spend in the body of the loop.

If more parallelism is needed then is is quite OK to process any number of primes at once, they don’t have to be processed in order, as long as the whole range is covered.

The result of each computation p is usually just a boolean: yes this p is a factor, or not it is not a factor. Because factors are rare, it is no problem to repeat the whole computation again if one is found. In fact this is done routinely using a different algorithm as a double-check. So possibly the primes could be generated by the CPU and fed to the GPU which reports back only for those which are factors, or else the whole algorithm could run on the GPU, I don’t know.

I realize that an efficient GPU implementation might be very different to the way it is done on a CPU, but some of the algorithms are very simple and so there might be an obvious way to proceed. The simplest is probably the factorial sieve, which looks like this:

Let n1 be an integer about 1,000,000 - 10,000,000 (?)

Let p0,p1 be integers about 2^45 - 2^50 (?) (higher is better)

For each prime p in p0 <= p < p1


	x = 2;

	For n from 1 to n1


	  x <-- xn mod p

	  if x = 1 (mod p)

		Report that p is a factor of n!-1

	  else if x = -1 (mod p)

		Report that p is a factor of n!+1



The actual numbers marked (?) I am just guessing, John can supply accurate figures if necessary. Please ask for any details that would help, but don’t expect immediate replies as I am only online a few times per week.

Sorry there was a mistake in the factorial sieve algorithm posted above, here is a correction. For all these algorithms I assume that the operation r = a*b mod p produces a result in the range 0 <= r < p:

Let n1 be an integer up to about 2^20 < n1 < 2^24 (?)

Let p0,p1 be integers in roughly 2^45 < p0 < p1 < 2^50 (?)

For each prime p in p0 <= p < p1


  x <-- 1;

  For n from 2 to n1


	x <-- x*n mod p

	if x = 1

	  Report that p is a factor of n!-1

	else if x = p-1

	  Report that p is a factor of n!+1



Here is the primorial sieve algorithm:

Let n be a prime up to about 2^20 < n < 2^24 (?)

Let p0,p1 be integers in roughly 2^45 < p0 < p1 < 2^50 (?)

Let Q be a list of the primes {2, 3, ..., n}

For each prime p in p0 <= p < p1


  x <-- 1;

  For each q in Q from first to last


	x <-- x*q mod p

	if x = 1

	  Report that p is a factor of q#-1

	else if x = p-1

	  Report that p is a factor of q#+1



Here is simplified algorithm for the Generalised Cullen sieve (Woodall is similar):

Let N be an ordered list of integers {n0, ..., nm}, where nm is about 2^20.

Let d be the greatest difference between successive elements of N.

Let D be a table with room for d+1 integers.

Let p0,p1 be integers in roughly 2^45 < p0 < p1 < 2^50 (?)

Let b be a small fixed integer (usually 2, but sometimes up to 1000 or so).

For each prime p in p0 <= p < p1


  D[0] <-- 1

  For i from 1 to d

	D[i] <-- b*D[i-1] mod p  /* so D[i] = b^i mod p */

x <-- b^{-nm} mod p

  if x = nm (mod p)

	Report that p is a factor of nm*b^nm+1

For j from m-1 down to 0


	x <-- x*D[n{j+1}-nj] mod p

	if x = nj (mod p)

	  Report that p is a factor of nj*b^nj+1



My implementation processes the loop “For i from m-1 down to 0” using four windows on N to achieve the necessary parallelism, but it could be done by unrolling the top-level loop instead.

Here is a rough algorithm for sr2seive (Sierpinski sequences k2^n+1 only, Riesel sequences kb^n-1 and Dual sequences b^n+/-k are similar).

Let K be a list of integers (the k in k*b^n+1, usually <2^16)

For each k in K, let Nk be a list of integers (the n in k*b^n+1, usually <2^24)

Let n0,n1 be the minimum,maximum elements in the union of all Nk

Let b be a small integer (the base b in k*b^n+1, often b=2)

Let p0,p1 be integers (up to 2^60 may become necessary)

Let H be a hashtable

Choose a good even integer value Q. Q=360 is currently used for the PSP sieve.

Choose a set S of prime powers. Current implementation uses S={2,3,4,5,8,9,16}

All other lowercase symbols are integers, uppercase symbols lists of integers unless otherwise stated.

For each prime p in p0 <= p <= p1


  F[0] <-- 1

  For i from 1 to Q-1

	F[i] <-- F[i]*b mod p  /* so that F[i] = b^i mod p */

/* Power residue tests */

  Let R be a maximal subset of S such that p = 1 (mod r) for all r in R

  c <-- 0

  For each k in K

	For each congruence class d modulo Q represented in Nk

	  If for any r in R, F[d] is an r-th power residue with respect to p


		C[c] <-- k

		D[c] <-- d

		E[c] <-- -k*F[d] mod p

		c <-- c+1


/* Choose m,n so that m*n >= (n1-n0)/Q */

  m <-- sqrt(c*(n1-n0)/Q)

  n <-- ceil((n1-n0)/Q/m)

/* m baby steps */

  Empty H

  w <-- b^{-Q} mod p

  x <-- b^{-floor(n0/Q)} mod p

  Insert (0,x) into H

  For h from 1 to m


	x <-- x*w mod p

	Insert (h,x) into H


/* n giant steps */

  y <-- b^Q^m mod p

  For i from 0 to n-1

	For j from 0 to c-1


	  For all z such that (z,E[j]) is in H

		Report that p is a factor of C[j]*b^{n0+Q*(i*m+z)+D[j]}+1

	  E[j] <-- E[j]*y mod p



Although it is may not be obvious, almost all of the modular multiplications in this algorithm can be arranged to be done as vector operations on fairly large arrays (of length m for the baby steps, length c for the giant steps), so there is more than enough parallelism to exploit on x86 CPUs. But unlike the other sieves it is difficult to process multiple primes p in parallel because each prime will result in different combinations of sequences passing the power residue tests, and so the threads will not stay in sync but must be individually scheduled.

The hashtable operations require many branches which may be a problem on a GPU, using a large enough hashtable makes most of the branches fairly predictable but increases cache misses. Insert operations cannot be done in parallel (at least I don’t know how to do them in parallel), but lookups can. For most of PrimeGrid’s projects a hashtable of up to 64Kb (2^15 16-bit keys) is adequate. If necessary the hashtable could be replaced by a flat array and a linear search operation used instead of a hashed lookup. That would probably be much slower on most CPUs, but would make it possible to replace branches with conditional moves.

That is some dense algorithmics…

I tried to read through it all, i really did!

I think it would be much easier for you to learn CUDA than for me to try to understand any of this ! :)

Check out the SDK and the kind of problems that get solved with CUDA, see if its good for you…
From what i understood from those 2 last chunks of code, you seem to require massive array, and massive lookups, which is not ideal…

Sorry i dont have much advice to offer, all i can say is that porting C code to CUDA is “easy” and shouldnt take you too long, once youve understood the basics.

FYI, for the above sieves, the following n and p should suffice:

factorial: our current sieve is for n<1M so n<2^20 is fine;

expected sieve depth should be well below p<2^50

primorial: our current sieve is for n<10M so n<2^24 is fine;

expected sieve depth should be well below p<2^50

gcwsieve: our current sieve is for 10M<n<25M so n<2^25 is fine;

expected sieve depth unknown but should be below p<2^50

sr2sieve: several sieves…largest is for n<50M so n<2^26 is fine;

expected sieve depth unknown…short term p<2^57 is fine; long term p<2^60 is fine