Sieve of Erastothenes

Hi,

just wondering if anyone has implemented a prime sieve, such as the Sieve of Eratosthenes in CUDA. If you have done so, you’re invited to share your particular parallelization approach here.

I know that Eratosthenes is not the most efficient prime generating algorithm, but this is about learning CUDA, not necessarily about algorithmic optimality.

We’re mostly interested in how many primes there are in a given numeric interval, and not necessarily an actual list of prime numbers. This should greatly relax the memory and memory bandwidth requirement (storing lots of numbers would consume gigabytes of memory and/or hard drive space and it also would make the kernel memory bound instead of compute bound which we are not really interested in). And for the sake of simplicity we will work with 64 bit integer numbers only.

We’re definitely going to document our findings, successes and failures in this thread.

Christian

EDIT: corrected spelling of the greek dude in posting body

Quite surprisingly, I designed a prime sieve for CUDA last week! It’s all just notes and not code, though.

The actual application I was considering was for Mersenne prime trial division. In this application, it’s a waste of work to divide by composites.

That’s not quite the same as counting primes.

You’re also wrong that seives are not efficient… they actually are the most efficient technique for making lists of primes or counting them.

The classic Sieve of Erastosthenes isn’t optimal, but (many) variations exist for different applications which are. In practice the Seive of Aitkin is a good balance between efficiency and ease implementation. (And amusingly, co-designed by Daniel Bernstein, who was the author of the CUDA code for CodingCrypto’s team for the SHA1 hash contest last month.)

Sieves are efficient just because of the fact you can mark composites with a tight, tight loop just racing through memory. That tight loop for a sieve of might be something as tiny as

while (i<big) V[i+=step]=1;

In practice you may use bitmasks instead of byte addressing.

The main issue when dealing with a sieve in CUDA is memory. (It’s always memory for all CUDA algorithms!)

The problem with CUDA is that you can only have a block with about 16K of shared memory, so even if you use bit addressing, you still only get “runs” of 16K*8=128K values.

This means it’s easy to sieve with primes up to about 32K or so… higher than that and you’re spending more time setting up your sieve step and less time actually sieving.

(the prime of about only 32K case means you’ll only be using it to mark 3 or 4 values!)

So my rough design was to abuse CUDA’s global memory. Just use threads and write at large strides. This should immediately make you cringe, since you’ll be doing an entire memory transaction of at least 32 bytes to set a single byte (or bit!), so it’s really wasteful. But the GPU has a ton of bandwidth so in practice you’d actually get better performance than the CPU. Compared to the GPU, the CPU gets a great speed boost from its L2 cache for primes of 32K to 8M. But after that, it’s got to use system memory and you’re down to only 5-20 GB/sec bandwidth, compared to the GPUs 150GB/sec.

Now this works really well for my trial-divisor problem because trial division is entirely computationally limited and uses no bandwidth. But the sieve is bandwidth limited and uses no computation. So my plan is to interleave them and do both simultaneously in the same inner loop!

If you’re looking to count primes, the seiving plan is roughly the same. I’d start with those very same global memory writes.

Now you do get into an interesting question… are global memory writes byte addressable in single transactions? That is, if I write byte[17] and a DIFFERENT thread, maybe from a different block, writes byte[16], is it guaranteed that both single bytes get written? If the memory is written in chunks, it’s possible that the memory gets READ, a byte masked into one word, and that modified WORD is written back. In that case you may have a memory write race. But is this true on current GPUs? we don’t know, it’s certainly not documented by the CUDA programming guide.

Thinking about it, we could (and probably should) use global atomics. This is in general terrible advice, but the reason that global atomics are inefficient is that they’re serialized and uncoalesced. But our one-bit writes scattered throughout memory are already single writes that are uncoalesced, so perhaps there’s no further inefficiency to using atomics anyway! The advantage of atomics is that you’re guaranteed not to have any memory transaction races like I worried about before. (And you can clearly use bit addressing to save memory… a big win. You’d use AtomicOr to mask in bits).

Christian, I like your quick “learning CUDA” examples. Even simple clear goals often find fun corners to explore both the algorithms and engineering issues of GPU programming.

Simple problems are often deceptively complex… “How to sort a list of numbers” is a topic entire lifetimes could be devoted to. (and have been!)

-Steve

Memory requirement for bit array is 1250001 bytes.

grid size: 1,1,1

block size: 256,1,1

CPU says there are 664579 primes up to and including 10000000.

GPU says there are 664579 primes up to and including 10000000.

Processing time (GPU): 2201.823975 (ms)

Processing time (CPU): 78.069000 (ms)

Oh no! The CPU is running circles around my GPU. It seems the task is a bit more difficult than anticipated.

My use of global atomics is a real speed killer. But if I don’t use atomics, the threads will step on each other’s toes.

And I do not yet know how to split up the problem into several thread blocks because I currently communicate any found prime via shared memory, before the “next round” of marking multiples can start.

Well, at least the answer is correct ;) Let’s see if my intern can beat me in this little challenge.

Christian

Yep, as I mentioned in my post, the CPU has a big advantage for ranges that fit into its L2 cache… it doesn’t need to go off chip at all. But as the range increases, it will need to and I bet it slows down and the GPU will pull ahead.

You’d need ranges significantly larger than the L2 cache size though… or L3 cache for that matter if your CPU has one. So if you’re doing bit-wise tables with a CPU with a 4 MB cache, you’d see slowdowns on the CPU speed for tables of significantly greater than 4M*32= 128M… so probably 2B or more. [of course if you were making a CPU version you’d probably just use a multipass algorithm to always stay in cache… I’m sure it’d be faster to do 10 tables of 100M than one table of 1B.]

An interesting question is whether a normal GPU memory transaction is much faster or slower than an atomic memory transaction. That is if you write ONE BYTE into global memory, you of course waste much of your possible bandwidth, but is that 1-byte write the same speed as an atomic? twice as fast? even slower?

As I was hypothesizing in my previous post, if a normal write of just one byte is faster than an atomic, you could reorganize the global memory table to use byte-wise marking. That sounds like there’s thread races, but there may not be since you’re never READING the results until the next kernel launch to check the result count. And if two threads (even from different blocks) write the same byte at the same time, that’s fine too since we are just setting the byte and don’t care which order the threads redundantly set the byte.

But if bytes aren’t the fundamental memory quantum, then you’d likely have to set whole words as your data element size. No speed difference but a big waste of memory.

I’m curious now, I’ve thought and typed about this problem more than I’ve actually coded. I should do some quickkie tests too… it will be interesting to see how the different options behave.

Thanks for chipping in, your ideas will be helpful. But I am not sure if going to byte-wise marking would solve the race issues. Maybe memory transactions are all at least 32 bit wide. Only experimentation will show this.

I’ve just done a shared memory version that correcly computes the 11312 primes in the range 0…120119 using a single thread block of 256 threads. Why this strange number, you might ask… I precompute the bit pattern for multiples of primes 2,3,5 and 7, so I get a periodic repeating pattern that consumes 15015 bytes. This snugly fits into shared memory and covers a numeric range of exactly 120120 integers.

Now I only need to mark multiples of the primes 11 and above within the thread block, and also all writes to the bit pattern are spaced enough to be guaranteed to affect different bytes. The different threads belonging to the same half warp won’t step on each other’s toes now.

grid size: 1,1,1

block size: 256,1,1

Processing time (GPU): 0.372000 (ms)

Processing time (CPU): 0.752000 (ms)

CPU says there are 11312 primes up to and including 120119.

GPU says there are 11312 primes up to and including 120119.

This is not yet optimized for bank conflict avoidance, but for the first time the GPU kernel is faster than my similar non-parallel CPU version. The pre-computation of the bit pattern is not included in the GPU time (I might as well store the pattern in the C source code, no time required for pre-computation).

Based on the primes I have determined until now, I could start launching a grid of many thread blocks that EACH sieve a range of 120120 integers, offset by 120120*blockIdx each. This will start getting inefficient when the stride becomes so large that only a few writes are made to shared memory for each prime (parallelism goes bust). So after that point, using global memory instead of shared memory will likely become faster.

Christian

You’ve reinvented the great idea of wheel factorization. And it’s definitely the right way to proceed… you can use shared memory to quickly mark off multiples of primes less than say 10,000, then use global memory to mark off larger regions, and combine the two results. This is very efficient since it minimizes the (many) writes to global memory for small primes… those are all done in shared memory.

There’s even a third stage you could apply if you really get fancy, which is to do multiple sieves with different residues mod say 235*7=350. This gives primes out of order, which doesn’t matter for your application. The big advantage is that it uses 1/350th of the memory… and that means you can use primes 350 times larger in shared memory.

To see how this works, realize that your entire table has every even number except 2 marked. So why bother marking them all or even using the memory to store it? We just store a table of 3 5 7 9 11 13 15 and so on and mark those.

Sieving works just fine even on this table with the even numbers compacted. You can then extend the idea and realize that every prime is either 1 mod 6 or 5 mod 6… so you can use a table 1/6 the size and just run the sieve twice, once for 1 7 13 19 25… and once for 5 11 17 23 29 …

Extending this up to mod 350 squeezes a lot of this easy efficiency out.

Of course now you’re getting crazy with three algorithms mixing… by that point you’d likely be using a more fancy sieve like Atkin’s.

OK, I whipped up a quick test to see just what was going on.

It tests six different global memory write strategies for correctness and speed.

As expected, these really scatted writes are pretty close to the same speed as an atomic write. The atomic is only about 15% slower.

I was hoping that atomic OR would have the same speed as atomic exchange… this would potentially be true if the memory controller had some extra smarts.

But no, the timings show that atomic OR is half-speed, which means it’s likely loading the word, masking it, then writing it.

The good news is that writing a single byte does seem to be an independent transaction! That is, writing one byte at a time does not corrupt nearby bytes which may also be being written to. And surprisingly, writing a byte is 10% faster than writing a word! I have no idea why that would be true.

Speed results from my 1.3 device, a QuadroFX 4800:

Testing Sieve tables up to N=10000000

Atomic Set found 664579 in 823.561 ms.

Atomic Or found 664579 in 1598.03 ms.

Write Word found 664579 in 691.239 ms.

Write Byte found 664579 in 622.04 ms.

Write Bit found 665070 in 2932.52 ms.

Read/Write Bit found 665067 in 2936.08 ms.

The last two confirm that global[index] |= mask is implemented as a read, mask, then rewrite. And completely as expected, that can have thread races and

therefore incorrect results.

A compute 1.1 device may have very different behavior… the memory controller is obviously redesigned for the new coalescing flexibility.

Code attached, it should be easy to extend with your own strategies.

Ignore the absolute times of the runs, this was designed to be a memory speed and behavior test, not a “sieve as fast as possible” test.

Thanks a lot, these results are interesting.

I now have a GPU version that is slightly faster than a single core of my P8400 Intel Core2 Duo when seeving up to 200 million integers on my laptop’s 9600M GT graphics chip. Above 200 million, the CPU becomes faster - and soon after that the watchdog timer ends the show for CUDA anyway.

This took me 8-9 hours and some reboots because I managed to completely garble the graphics driver on openSuse Linux a couple of times (CUDA 2.2 SDK).

I’ll let the student intern compete against my version this week. ;) Anyone else is invited to compete as well, but please stick to Eratosthenes or only very slight alterations of this algorithm for fairness.

buchner@linux-dv5:~/NVIDIA_CUDA_SDK/projects/primes> ../../bin/linux/release/primes --max=200000000

RANGE is [0...200000000]

Total memory requirement for bit array is 24428 kB.

Computing on GPU...

Processing time (GPU): 4125.684082 (ms)

Computing on CPU...

Processing time (CPU): 4351.317871 (ms)

CPU says there are 11078937 primes from 0 to including 200000000.

GPU says there are 11078937 primes from 0 to including 200000000.

NOTE: My timings only cover the kernel run time, memory copies from and to the device are not included. Also the counting of the primes (bit population count) is currently done offline on the CPU and not included in the timings. Timings above are from an 9600M GT with 32 “CUDA processors”.

I will post my source code a little later.

Christian

Great, the prize is an iPhone?

Just joking…

Hey, could you run that test code I posted on your machine? I’m curious how a 1.1 device will behave.

I only have a really old Sony Ericsson T230 that I could offer, hehe.

buchner@linux-dv5:~> ./a.out 10000000

Testing Sieve tables up to N=10000000

Atomic Set found 664579 in 4640.85 ms.

Atomic Or found 664579 in 3177.31 ms.

Write Word found 664579 in 1111.9 ms.

Write Byte found 664579 in 1024.86 ms.

Write Bit found 664751 in 3905.22 ms.

Read/Write Bit found 664777 in 3896.32 ms.

Boy, these Atomics suck on Compute 1.1

Christian

Wow! That is indeed interesting. Well, now we know that atomics were improved in G200.

The interesting thing is that on your 1.1 device, the atomicOr was faster than atomicExch!

The practical fact you just learned is that on both 1.1 and 1.3 devices, if you use byte-wise addressing, the results above support the fact that single byte writes won’t interfere, and they give you the best performance. It uses 8 times too much memory, but for this application this doesn’t matter for N<500M or so.

To you architecture guys:
Two minor puzzles from the testing above. They’re not too important but they are curious.

  1. Why would 1.1 hardware have a slower atomicExch than atomicOr? I would expect them to be the same, or the atomicOr be about 1/2 speed like 1.3 devices.

  2. Why would globally writing a single byte be 10% faster than writing a single aligned word on both 1.1 and 1.3 hardware? I’d expect them to be the same, and if there was an asymmetry, I’d expect the word to be faster.

Just a little update: By copying the initially computed primes from the first block into constant memory (with a really fast device to device copy), I was able to get a speed-up of roughly 2 on the GPU side. So now the midrange laptop GPU clearly outperforms the single threaded CPU version sieving up to half a billion integers (using the same basic algorithm). Above that, the watchdog timer says Hi

Of course I stand no chance against primegen 0.9.7, which uses the Sieve of Atkins ;)

EDIT: Now getting all primes up to 1 billion (1E9) in 4.6 seconds on an nVidia 8800GT (112 cuda cores). Not too shabby but not a world record either.

Christian

Has anyone seen this:

http://www.unitedti.org/index.php?showtopic=8888

It’s only tangentially related to this thread, but it made me wonder if anyone has implemented any integer factorization code on the GPU, and if so, what kind of results you got. I bet that some good CUDA-based factorization code could crack that key in less than 2 days on a multi-GPU rig (like a 2xGTX295 or better).

Thanks for the link!

I think that is the only time I’ve ever seen an actual useful (not theoretical) result from a giant factoring project!

Now the question of factoring on the GPU is a great one.

It clearly can be done, and pretty efficiently, but there are a lot of steps. In particular there are three big phases to factoring using a modern algorithm, all of which are quite different problems, but also all of them could be done on GPUs.

With effort, of course.

The sieving step is the biggest one… and that uses some fancy sieves which are a lot different than the simple SoA we’re talking about in this thread. I don’t know enough to understand exactly how it works. (I do understand the quadratic number sieve (the circa 1980 algorithm) , but not the General number sieve which is the modern algorithm.)

A great book for understanding all these techniques (in quite readable text, a real feat!), coincidentally sitting here next to me now, is Prime Numbers, a Computational Perspective.

I wonder if we could contribute a kernel to this BOINC project that they set up:

http://boinc.unsads.com/rsals/

Apparently, the people on that forum want to try to recover the keys from some of the other models as well.

I’m not really sure what the details are with BOINC programming (I’ve heard of it, but never used it before), but I might give it a whirl this weekend and see what I can get done.

I am bringing in the cavalry to hijack my thread back.

Here’s some code to sieve primes on the GPU, in Erastothenes flavour. It uses bitwise marking in shared memory to sieve. Memory races are avoided by large strides of 17 bits or more, avoiding to hit the same byte from several threads. One thread block sieves exactly 120120 primes, which fits into 15015 bytes in shared memory.

The very first thread block runs “solo” on the GPU and supplies the primes for sieving on a (possible huge) grid of following blocks. The primes determined by the first block are copied to constant memory for better speed.

Sieving up to 1 billion (1E9) primes on a 8800GT is possible in 4.6 seconds. Mind the watchdog timer on lesser machines.

Example of use (slightly abridged output):

buchner@linux-dv5:~/NVIDIA_CUDA_SDK/projects/primes> ../../bin/linux/release/primes --min=0 --max=40000000

Computing on GPU...

Processing time (GPU): 200.789993 (ms)

Computing on CPU...

Processing time (CPU): 640.202026 (ms)

CPU says there are 2433654 primes from 0 to including 40000000.

GPU says there are 2433654 primes from 0 to including 40000000.

Based on an SDK makefile, so place everything in a subfolder of your “projects” dir. Tested on CUDA SDK V 2.2

############################################################

#

# Build script for project

#

############################################################

# Add source files here

EXECUTABLE	:= primes

# CUDA source files (compiled with cudacc)

CUFILES		:= primes.cu

# CUDA dependency files

CU_DEPS		:= primes_kernel.cu

# C/C++ source files (compiled with gcc / c++)

CCFILES		:= primes_gold.cpp

############################################################

# Rules and targets

include ../../common/common.mk

Host side code:

/**

 * Sieve of Erastothenes on CPU and GPU.

 * originally based on Template project.

 * Host code.

 */

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil_inline.h>

// includes, kernels

#include <primes_kernel.cu>

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest( int argc, char** argv);

extern "C"

void computePrimes( unsigned char* bitbuffer, unsigned long long bitbufsize, unsigned long long N );

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv) 

{

	runTest( argc, argv);

	cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run prime sieve on CUDA and CPU

////////////////////////////////////////////////////////////////////////////////

void

runTest( int argc, char** argv) 

{

	unsigned int N = 120119;			 // largest number N in first block

	unsigned int NUM = N+1;			  // number of integers in range [0...N]

	unsigned int bitbufsize = (NUM+7)/8; // number of bytes required to hold bit pattern per block

	unsigned char *bitbuffer_gold = NULL;

	int min, max;

	// use command-line specified CUDA device, otherwise use device with highest Gflops/s

	if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

		cutilDeviceInit(argc, argv);

	else

		cudaSetDevice( cutGetMaxGflopsDeviceId() );

	// TODO: possibly extend this to support 64 bit integer arguments

	if (!cutGetCmdLineArgumenti( argc, (const char**) argv, "min", &min))

		min = 0;

	if (!cutGetCmdLineArgumenti( argc, (const char**) argv, "max", &max))

	{

		max = 100000000;

		fprintf(stderr, "Use --max option to set upper bound.\n");

	}

	unsigned long long RANGEMIN = min;

	unsigned long long RANGEMAX = max;

	unsigned int BLOCKS = ((RANGEMAX+1) + (NUM-1)) / NUM;

	fprintf(stderr, "RANGE is [%lld...%lld]\n", RANGEMIN, RANGEMAX);

	fprintf(stderr, "%d BLOCKS are needed to cover that range.\n", BLOCKS);

	fprintf(stderr, "Total memory requirement for bit array is %lld kB.\n", BLOCKS*bitbufsize / 1024);

	fprintf(stderr, "Allocating and clearing host memory.\n");

	bitbuffer_gold = (unsigned char*)calloc(1, BLOCKS*bitbufsize);

	unsigned int timer = 0;

	unsigned int timer2 = 0;

	fprintf(stderr, "Allocating and clearing device memory.\n");

	// allocate and cleardevice memory for result

	unsigned char* d_bitbuffer = NULL;

	cutilSafeCall( cudaMalloc( (void**) &d_bitbuffer, BLOCKS*bitbufsize));

	cutilSafeCall( cudaMemset( d_bitbuffer, 0, BLOCKS*bitbufsize ) );

	fprintf(stderr, "Precomputing repeating bit pattern.\n");

	// allocate memory for the precomputed pattern on host side

	unsigned char* h_prepattern = (unsigned char*) calloc(1, bitbufsize);

	unsigned int *h_iprepattern = (unsigned int*) h_prepattern;

	// mark out ALL multiples of given primes in precomputed pattern

	// We could also store this in a table of size 15015 bytes, saving the need for computation.

	unsigned int primes[] = {2,3,5,7,11,13};

	for (int j=0; j<sizeof(primes)/sizeof(int); j++)

		for (unsigned long long i = 0; i <= N; i += primes[j])

			h_iprepattern[i/32] |= 1<<(i&31);

	// allocate device memory for precomputed pattern and copy pattern to device

	unsigned char* d_prepattern = NULL;

	cutilSafeCall( cudaMalloc( (void**) &d_prepattern, bitbufsize));

	cutilSafeCall( cudaMemcpy( d_prepattern, h_prepattern, bitbufsize, cudaMemcpyHostToDevice ));

	// setup execution parameters

	dim3  gridsize( 1, 1, 1);

	dim3  blocksize( 32*6, 1, 1);

	dim3  gridsize2( BLOCKS-1, 1, 1);

	dim3  blocksize2( 32*6, 1, 1);

	fprintf(stderr, "Cooking kernels...\n");

	// cook the first kernel

	primesSMKernel<<< gridsize, blocksize, bitbufsize >>>( d_prepattern, d_bitbuffer, bitbufsize, 0 );

	cudaThreadSynchronize();

	if (BLOCKS > 1)

	{

		// cook the second kernel

		primesSMKernel2<<< gridsize2, blocksize2, bitbufsize >>>( d_prepattern, d_bitbuffer, bitbufsize, 0 );

		cudaThreadSynchronize();

	}

	fprintf(stderr, "Computing on GPU...\n");

	cutilCheckError( cutCreateTimer( &timer));

	cutilCheckError( cutStartTimer( timer));

	// execute the first kernel

	primesSMKernel<<< gridsize, blocksize, bitbufsize >>>( d_prepattern, d_bitbuffer, bitbufsize, NUM );

	cudaThreadSynchronize();

	if (BLOCKS > 1)

	{

		// copy the precomputed prime bitarray into constant memory (for better caching)

		cudaMemcpyToSymbol( d_prebitbuffer, d_bitbuffer, bitbufsize, 0, cudaMemcpyDeviceToDevice );

		// execute the second kernel

		primesSMKernel2<<< gridsize2, blocksize2, bitbufsize >>>( d_prepattern, d_bitbuffer, bitbufsize, NUM );

		cudaThreadSynchronize();

	}

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	cutilCheckError( cutStopTimer( timer));

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	fprintf(stderr, "Processing time (GPU): %f (ms)\n", cutGetTimerValue( timer));

	cutilCheckError( cutDeleteTimer( timer));

	// allocate mem for the result bitarray on host side

	unsigned char* h_bitbuffer = (unsigned char*) malloc(BLOCKS * bitbufsize);

	// copy result bitarray to host side

	cutilSafeCall( cudaMemcpy( h_bitbuffer, d_bitbuffer, BLOCKS * bitbufsize, cudaMemcpyDeviceToHost ));

	fprintf(stderr, "Computing on CPU...\n");

	cutilCheckError( cutCreateTimer( &timer2));

	cutilCheckError( cutStartTimer( timer2));

	// compute reference solution

	computePrimes( bitbuffer_gold, BLOCKS*bitbufsize, BLOCKS*NUM-1 );

	cutilCheckError( cutStopTimer( timer2));

	fprintf(stderr, "Processing time (CPU): %f (ms)\n", cutGetTimerValue( timer2));

	cutilCheckError( cutDeleteTimer( timer2));

	// count primes on CPU side

	unsigned long long count = 0;

	for (unsigned long long i=RANGEMIN; i <= RANGEMAX; i++)

	{

		if ((bitbuffer_gold[i/8] & (1<<(i&7))) == 0)

			count++;

	}

	fprintf(stderr, "CPU says there are %lld primes from %lld to including %lld.\n", count, RANGEMIN, RANGEMAX);

	// count primes on GPU side

	count = 0;

	unsigned int *h_ibitbuffer = (unsigned int*)h_bitbuffer;

	for (unsigned long long i=RANGEMIN; i <= RANGEMAX; i++)

	{

		if ((h_ibitbuffer[i/32] & (1<<(i&31))) == 0)

			count++;

	}

	fprintf(stderr, "GPU says there are %lld primes from %lld to including %lld.\n", count, RANGEMIN, RANGEMAX);

	// compare the results bit by bit and warn on mismatches

	for (unsigned long long i=0; i <= BLOCKS*NUM; i++)

	{

		bool cpu = (bitbuffer_gold[i/8] & (1<<(i&7))) == 0;

		bool gpu = (h_ibitbuffer[i/32] & (1<<(i&31))) == 0;

		if (cpu != gpu) fprintf(stderr, "WARNING: CPU and GPU disagree on primality of %lld (CPU: %d, GPU: %d)\n", i, cpu, gpu);

	}

	free(bitbuffer_gold);

	free(h_bitbuffer);

	cutilSafeCall(cudaFree(d_bitbuffer));

	cudaThreadExit();

}

“Gold” kernel:

**/

* Sieve of Eratosthenes.

* Host Code

*/

#include <math.h>

#include <stdio.h>

////////////////////////////////////////////////////////////////////////////////

// export C interface

extern "C" 

void computePrimes( unsigned char* bitbuffer, unsigned long long bitbufsize, unsigned long long N );

////////////////////////////////////////////////////////////////////////////////

//! Compute reference prime set

////////////////////////////////////////////////////////////////////////////////

void

computePrimes( unsigned char* bitbuffer, unsigned long long bitbufsize, unsigned long long N )

{

	unsigned long long firstprime = 2;

	unsigned long long sqrt_N = ceil(sqrt(N));

	

	bitbuffer[0] |= 3;  // 0 and 1 are not primes.

	

	for (;;)

	{

		// mark out multiples of primes

		for (unsigned long long i = firstprime * firstprime; i <= N; i += firstprime)

			bitbuffer[i/8] |= 1<<(i&7);

		// search for next prime

		for (firstprime = firstprime + 1; firstprime < N; firstprime++)

			if ((bitbuffer[firstprime/8] & (1<<(firstprime&7))) == 0) break;

			

		if (firstprime > sqrt_N) break;

	}

}

GPU code, two kernels

/* Sieve of Eratosthenes

 * Device Code

 */

#ifndef _PRIMES_KERNEL_H_

#define _PRIMES_KERNEL_H_

#include <stdio.h>

#define SDATA( index)	  cutilBankChecker(sdata, index)

#ifdef __DEVICE_EMULATION__

#define EMUSYNC __syncthreads()

#else

#define EMUSYNC

#endif

#define __register__ volatile

////////////////////////////////////////////////////////////////////////////////

//! Sieve of Eratosthenes on GPU

////////////////////////////////////////////////////////////////////////////////

__global__ void

primesSMKernel( unsigned char *g_prepattern, unsigned char *g_bitbuffer, unsigned int bitbufsize, unsigned int NUM )

{

	extern unsigned char __shared__ sdata[];

	if (NUM > 0)

	{

		__register__ unsigned int *isdata = (unsigned int*)&sdata[0];

		__register__ unsigned int *g_ibitbuffer	= (unsigned int*)g_bitbuffer;

		__register__ unsigned int *g_iprepattern   = (unsigned int*)g_prepattern;

		__register__ unsigned int num	= bitbufsize / 4;  // round down

		__register__ unsigned int remain = bitbufsize % 4;  // remainder

		__register__ const unsigned int idx = threadIdx.x;

		// initialize shared memory with precomputed bit pattern for primes 2, 3, 5, 7, 11, 13

		for (__register__ int i=0; i < num; i+= blockDim.x)

			if (i+idx < num) isdata[i+idx] = g_iprepattern[i+idx];

		if (idx < remain) sdata[4*num+idx] = g_prepattern[4*num+idx];

		__syncthreads();

		unsigned int __shared__ firstprime;

		__register__ unsigned int sqrt_N = ceil(sqrtf((float)NUM));

	

		if (threadIdx.x == 0)

		{

			firstprime = 17;  // start marking multiples of primes beginning with prime 11

			sdata[0] = 0x53;  // 2 is prime, 3 is prime, 5 is prime, 7 is prime, the rest in this byte isn't

			sdata[1] = 0xd7;  // 11 is prime, 13 is prime

		}

		__syncthreads();

		while (firstprime <= sqrt_N)

		{

			// mark out all multiples of "firstprime" starting with firstprime squared.

			for (unsigned int i = (firstprime+idx) * firstprime; i < NUM; i += firstprime*blockDim.x)

				sdata[i>>3] |= (1<<(i&7));

			__syncthreads();

			// search for next prime (unmarked number) in the bit array using a single thread

			if (threadIdx.x == 0)

				for (firstprime = firstprime + 1; firstprime < NUM; firstprime++)

					if ((sdata[firstprime>>3] & (1<<(firstprime&7))) == 0) break;

			__syncthreads();

		}

		// coalesced and bank-conflict free 32 bit integer copy from shared to global

		for (__register__ int i=0; i < num; i+= blockDim.x)

			if (i+idx < num) g_ibitbuffer[i+idx] = isdata[i+idx];

		// copy remaining bytes

		if (idx < remain) g_bitbuffer[4*num+idx] = sdata[4*num+idx];

	}

}

__device__ __constant__ unsigned char d_prebitbuffer[65536];

__global__ void

primesSMKernel2( unsigned char *g_prepattern, unsigned char *g_bitbuffer, unsigned int bitbufsize, unsigned int NUM )

{

	extern unsigned char __shared__ sdata[];

	if (NUM > 0)

	{

		__register__ unsigned int *isdata = (unsigned int*)&sdata[0];

		__register__ unsigned int *g_ibitbuffer	= (unsigned int*)g_bitbuffer;

		__register__ unsigned int *g_iprepattern   = (unsigned int*)g_prepattern;

		__register__ unsigned int num	= bitbufsize / 4;  // round down

		__register__ unsigned int remain = bitbufsize % 4;  // remainder

		__register__ const unsigned int idx = threadIdx.x;

		// initialize shared memory with precomputed bit pattern for primes 2, 3, 5, 7, 11, 13

		for (__register__ int i=0; i < num; i+= blockDim.x)

			if (i+idx < num) isdata[i+idx] = g_iprepattern[i+idx];

		if (idx < remain) sdata[4*num+idx] = g_prepattern[4*num+idx];

		// K is the block-specific offset

		unsigned long long K = NUM * blockIdx.x + NUM;

		__syncthreads();

		unsigned int __shared__ firstprime;

		__register__ unsigned int sqrt_KN = ceil(sqrtf((float)(K+NUM)));

	

		if (threadIdx.x == 0)

		{

			firstprime = 17;  // start marking multiples of primes beginning with prime 17

		}

		__syncthreads();

		while (firstprime <= sqrt_KN)

		{

			// compute an offset such that we're instantly entering the range of

			// K...K+N in the loop below

			// Because 64 bit division is costly, use only the first thread

			unsigned int __shared__ offset;

			if (threadIdx.x == 0)

			{

				offset = 0;

				if (K >= firstprime*firstprime) offset = (K-firstprime*firstprime) / firstprime;

			}

			__syncthreads();

			// mark out all multiples of "firstprime" that fall into this thread block, starting

			// with firstprime squared.

			for (unsigned long long i = (offset+firstprime+idx) * firstprime;

				 i < K+NUM;

				 i += firstprime*blockDim.x)

				if (i >= K) sdata[(i-K)>>3] |= (1<<((i-K)&7));

			__syncthreads();

			// search for next prime (unmarked number) in the reference bit array using a single thread

			if (threadIdx.x == 0)

				for (firstprime = firstprime + 1; firstprime < NUM; firstprime++)

					if ((d_prebitbuffer[firstprime>>3] & (1<<(firstprime&7))) == 0) break;

			__syncthreads();

		}

		// byte-by-byte uncoalesced and bank-conflict-prone copy from shared to global memory

		// TODO: create a generic coalesced copy routine that works with arbitrary

		//	   output byte offsets on compute 1.0 and 1.1 hardware

		__register__ unsigned int byteoff = bitbufsize * blockIdx.x + bitbufsize;

		for (__register__ int i=0; i < bitbufsize; i+= blockDim.x)

			if (i+idx < bitbufsize) g_bitbuffer[byteoff+i+idx] = sdata[i+idx];

	}

}

#endif // #ifndef _PRIMES_KERNEL_H_

Christian

Now that is an interesting strategy to avoid races. This is obviously a common worry.

Are your prime counts always accurate? I’ve found Wolfram|Alpha to be a great, great help for checking with an independent source.

I love the idea of using the fact that your strides are large to guarantee that the threads in a warp won’t step on each other. Since those threads advance in lockstep, they won’t interfere.

But what about a thread from two different warps? Say thread 0 wants to mark bit 800, and thread 32 wants to mark bit 801. Both threads load the shared[100] byte. One thread masks it with 1, one thread masks it with 2. The byte is rewritten back to shared memory. But the (potential) race is if thread 32 reads after thread 0 reads, but before thread 0 writes. It’s a narrow window of opportunity, but still a potential race failure.

edit later: Ah, after actually looking at the code, I understand. All the threads are using the SAME prime, and you have them spaced out. So they can’t collide. I was expecting every thread to have its own prime. That’s more efficient (especially as the primes grow in size and there’s less bits to set) but it’s also where the write collisions potentially occur.

Thanks much for posting the code! That kind of sharing is how we all learn (and play).

A zip file attachment is usually easier since cut-and-pasting multiple files is awkward and you have to match up filenames, but that’s not a big deal obviously.

You of course have now made me want to make my own version! A simple goal, fun algorithmic issues, easy to measure success and speed… it’s too tempting to take an afternoon playing. We’ll see.

In theory I’m supposed to be doing work…

Thanks, Christian!

Wow Christian, that looks great. Maybe nVidia could include this as an SDK project in the next release?

Sorry for hijacking your thread though ;)

Hi,

Not that I’m a math guy… but something I noticed in the second kernel:

....

		   // search for next prime (unmarked number) in the reference bit array using a single thread

			if (threadIdx.x == 0)

				for (firstprime = firstprime + 1; firstprime < NUM; firstprime++)

					if ((d_prebitbuffer[firstprime>>3] & (1<<(firstprime&7))) == 0) break;

	 ....

If this loop takes time and the NUM - firstprime +1 is big, you might benefit from doing a scan and reduction

using multiple threads instead of running this loop in single thread.

my 0.0001 cent ;)

eyal