Sorry there was a mistake in the factorial sieve algorithm posted above, here is a correction. For all these algorithms I assume that the operation r = a*b mod p produces a result in the range 0 <= r < p:
Let n1 be an integer up to about 2^20 < n1 < 2^24 (?)
Let p0,p1 be integers in roughly 2^45 < p0 < p1 < 2^50 (?)
For each prime p in p0 <= p < p1
{
x <-- 1;
For n from 2 to n1
{
x <-- x*n mod p
if x = 1
Report that p is a factor of n!-1
else if x = p-1
Report that p is a factor of n!+1
}
}
Here is the primorial sieve algorithm:
Let n be a prime up to about 2^20 < n < 2^24 (?)
Let p0,p1 be integers in roughly 2^45 < p0 < p1 < 2^50 (?)
Let Q be a list of the primes {2, 3, ..., n}
For each prime p in p0 <= p < p1
{
x <-- 1;
For each q in Q from first to last
{
x <-- x*q mod p
if x = 1
Report that p is a factor of q#-1
else if x = p-1
Report that p is a factor of q#+1
}
}
Here is simplified algorithm for the Generalised Cullen sieve (Woodall is similar):
Let N be an ordered list of integers {n0, ..., nm}, where nm is about 2^20.
Let d be the greatest difference between successive elements of N.
Let D be a table with room for d+1 integers.
Let p0,p1 be integers in roughly 2^45 < p0 < p1 < 2^50 (?)
Let b be a small fixed integer (usually 2, but sometimes up to 1000 or so).
For each prime p in p0 <= p < p1
{
D[0] <-- 1
For i from 1 to d
D[i] <-- b*D[i-1] mod p /* so D[i] = b^i mod p */
x <-- b^{-nm} mod p
if x = nm (mod p)
Report that p is a factor of nm*b^nm+1
For j from m-1 down to 0
{
x <-- x*D[n{j+1}-nj] mod p
if x = nj (mod p)
Report that p is a factor of nj*b^nj+1
}
}
My implementation processes the loop “For i from m-1 down to 0” using four windows on N to achieve the necessary parallelism, but it could be done by unrolling the top-level loop instead.
Here is a rough algorithm for sr2seive (Sierpinski sequences k2^n+1 only, Riesel sequences kb^n-1 and Dual sequences b^n+/-k are similar).
Let K be a list of integers (the k in k*b^n+1, usually <2^16)
For each k in K, let Nk be a list of integers (the n in k*b^n+1, usually <2^24)
Let n0,n1 be the minimum,maximum elements in the union of all Nk
Let b be a small integer (the base b in k*b^n+1, often b=2)
Let p0,p1 be integers (up to 2^60 may become necessary)
Let H be a hashtable
Choose a good even integer value Q. Q=360 is currently used for the PSP sieve.
Choose a set S of prime powers. Current implementation uses S={2,3,4,5,8,9,16}
All other lowercase symbols are integers, uppercase symbols lists of integers unless otherwise stated.
For each prime p in p0 <= p <= p1
{
F[0] <-- 1
For i from 1 to Q-1
F[i] <-- F[i]*b mod p /* so that F[i] = b^i mod p */
/* Power residue tests */
Let R be a maximal subset of S such that p = 1 (mod r) for all r in R
c <-- 0
For each k in K
For each congruence class d modulo Q represented in Nk
If for any r in R, F[d] is an r-th power residue with respect to p
{
C[c] <-- k
D[c] <-- d
E[c] <-- -k*F[d] mod p
c <-- c+1
}
/* Choose m,n so that m*n >= (n1-n0)/Q */
m <-- sqrt(c*(n1-n0)/Q)
n <-- ceil((n1-n0)/Q/m)
/* m baby steps */
Empty H
w <-- b^{-Q} mod p
x <-- b^{-floor(n0/Q)} mod p
Insert (0,x) into H
For h from 1 to m
{
x <-- x*w mod p
Insert (h,x) into H
}
/* n giant steps */
y <-- b^Q^m mod p
For i from 0 to n-1
For j from 0 to c-1
{
For all z such that (z,E[j]) is in H
Report that p is a factor of C[j]*b^{n0+Q*(i*m+z)+D[j]}+1
E[j] <-- E[j]*y mod p
}
}
Although it is may not be obvious, almost all of the modular multiplications in this algorithm can be arranged to be done as vector operations on fairly large arrays (of length m for the baby steps, length c for the giant steps), so there is more than enough parallelism to exploit on x86 CPUs. But unlike the other sieves it is difficult to process multiple primes p in parallel because each prime will result in different combinations of sequences passing the power residue tests, and so the threads will not stay in sync but must be individually scheduled.
The hashtable operations require many branches which may be a problem on a GPU, using a large enough hashtable makes most of the branches fairly predictable but increases cache misses. Insert operations cannot be done in parallel (at least I don’t know how to do them in parallel), but lookups can. For most of PrimeGrid’s projects a hashtable of up to 64Kb (2^15 16-bit keys) is adequate. If necessary the hashtable could be replaced by a flat array and a linear search operation used instead of a hashed lookup. That would probably be much slower on most CPUs, but would make it possible to replace branches with conditional moves.