Sieve of Erastothenes

I tried a reduction on the following blockDim.x bits (only taking into account the bits identified as a prime). This was actually slower.

On average it takes more instructions (and scheduled warps) to complete than just scanning the next few bits with a single thread.

Sad but true ;)

Christian

It isn’t that great after all. My student built something on SPWorley’s byte-wise write marking concept (in global memory). For my code to be really useful I would have to split it up into several shorter kernel runs to avoid the watchdog timer and to reach regions above 2^^32. And it would need require a counting of the primes on the GPU (bit population count followed by a reduction kernel).

His first test runs seem to beat me by a factor 2, but his code produces incorrect results in some cases. So I am not yet accepting defeat.

Christian

I verified against primegen which I modified to output just the total count instead of a list of all primes. My test runs so far only

were up to 1E9 which requires ~128MB of contiguous memory for the bit pattern.

Sadly, primegen 0.9.7 computes all primes up to 1 billion in 2 seconds on the CPU (after commenting out the writing of digits to stdout), being more than twice as fast as my code on the 8800GT GPU (112 shaders)

Christian

For those who are interested in a second perspective on Erastothenes, here’s my intern’s code using byte-wise masking in global memory and a slightly different approach to distributing the work load among thread blocks. Each block processes a different sub-set of the precomputed primes.

Attached as a .tar.gz archive this time.

His versions even accepts scientific notation for the search range on the command line. We’ve tested up to 400 million because his memory requirements are higher. Example of use:

countprimes 0 1e8

Surprisingly, his “gold” CPU kernel produced wacko results, but the GPU based counts are correct. So disregard the output for the CPU.

And his version is faster than mine (for large ranges only!), so my use of shared memory was not providing the advantage I had hoped for.

So now, if you’d excuse us - we’re going to do 4x4 MIMO radio channel simulations for a Rayleigh fading channel in the Long Term Evolution (LTE) radio system on the GPU ;)

I hate nagging questions, so I finally went back to this problem to see what was up with write speeds especially with Christian’s shared memory variant.

In particular, my previous experiment in this thread measured various write strategies for global memory, but we’re all expecting shared memory to be faster and perhaps have different memory write sizes (does writing a byte affect neighbor bytes in a race?)

So I repeated my tests using shared memory writes. I compared all the methods for shared memory writing, and then used the same test set to measure global memory methods (with one block running, so minimal bandwidth issues), and finally “busy global”, where many blocks are writing at once so bandwidth matters.

Here clock() counts are useful and keep measurements quite accurate and repeatable. [Thanks for including that, NV guys!]

The transaction results for shared memory show that just as with global memory, writing one byte is always safe even in the presence of other simultaneous writes from other threads to the neighbor bytes. Writing bits is unsafe, which is expected and again just like global memory.

Timing wise, we see some pleasant surprises, mostly in the quite nice speed of global writes when compared to shared writes.

Below is timing for a sieve using different write strategies. The absolute times are not important, just the relative speeds between different methods.

Method		 KClocks  KClocks  KClocks

			   Shared   Global   BusyGlobal

Atomic Set	  682	  303	   4256

Atomic Or	   637	  863	  13845

Write Word	  204	  302		409

Write Byte	  187	  271		366

Write Bit	   320	 1287	   2407

R/W Bit		 320	 1291	   2405

Some notes:

Shared memory atomics are cheap, only about 3X the cost of a normal shared write. This is in a test that has constant contention (every thread is writing as fast as it can), but the writes are spread out effectively randomly over 4K addresses. I’m sure speeds would be lower if they were all pounding just a few addresses.

Unlike global atomics, it’s slightly cheaper to use a shared atomic-or than a shared atomic exchange.

Writing one bit (as expected) causes races and is slower, since it’s implemented as a read-mask-write set of operations.

It is surprising but writing to global memory is only about 1.5X slower than writing to shared memory. When we think of “shared memory is fast!” we’re thinking of bandwidth. But the actual operation throughput is pretty comparable if you’re not at the bandwidth limit. I really am impressed by the NV engineers on this one. In my head I was expecting a 10X difference or something. Tim Murray once said that the global writes are fire-and-forget and therefore have high throughput. This test shows it’s completely true.

When the bus is not busy, a global atomic exchange has identical throughput to a normal global write! This speed of both normal writes and atomics collapses when the global memory bus is busy and you have a lot of SMs all writing at once, as you can see in the last column. And as you might expect, global atomics are more sensitive to the congestion. So the rough guideline of “go ahead and use atomics sparsely” is well supported.

How do these results affect something like a sieve with its sparse writes? Should you still do some pre-staging in shared memory? The answer is likely yes, it will be faster to, but not in some huge order of magnitude way. The other takeaway is to see that even high-contention shared atomics are pretty efficient. I always stayed away from them since doing reductions or compactions with the traditional parallel methods is so ingrained, but maybe I’ll use them more now. Shared atomics are compute 1.2+ only, however, so that lower portability is probably the big negative to them.

Thank you for these insights.

EDIT: let me add here that I found that the stride (within the same half warp) at which atomicOr operations write to global memory will greatly affect the resulting speed. A bigger stride was enforced by swapping the 4 high and low bits (a.k.a. nibble) of the threadIDx.x variable before assigning each thread its position in marking a multiple of a prime number (a total of 256 threads were used). This effectively increased the stride to 16 times in each half-warp. Execution speed tripled.

I’ve also noticed that my shared memory variant could be further optimized in several ways, maybe I’ll beat the speed of the global memory version that Aamir wrote.

I wonder what kind of problems will be posed in the upcoming Top Coder competition. I bet it’s going to be relatively simple jobs that will require a lot of low level optimization and tweaking to be a winner.

The sites hosting the prime factors for the TI calculators are now served with DMCA takedown notices.

Hello, it’s two friggin’ prime numbers. And the factoring serves the purpose of interoperability - one exemption

granted by the DMCA.

Trying to take down two large numbers will only result in a Streisand effect. Wo is going to make the first T-Shirt with the keys on it ;)

Christian

Yeah, I agree that atomic throughput isn’t a simple linear function of contention collision rate.

It’s very possible that it’s global memory segment related, if I recall correctly there’s 8 big segments to global memory writes and a little like shared bank conflicts, global accesses have their own chunks too. That’s barely mentioned anywhere online, I remember learning about it from a slide in a presentation.

That would explain why your address swizzle helped, it would tend to randomize which segment is written to.

I know this thread has been dormant a while, but after a quick search it still seems like the most relevant (apparently there’s not a lot of work going on to optimize the sieve of Eratosthenes on a GPU). I’ve recently coded a sieve that is quite fast so I thought I’d share it here. It is a segmented sieve. Each thread block sieves a different sub-interval of the desired interval and then counts the primes in that sub-interval. (A final count over all of the sub-intervals occurs in host code.) Within each thread block, 256 threads simultaneously sieve 8192 bytes of shared memory. Each thread is responsible for sieving a small set of primes. Race conditions are avoided by eliminating read-modify-write operations and simply doing byte writes instead. Except for the smallest primes which unroll a pre-computed table over a bit-packed copy of the shared sieve block. The speed comes from parallelization of the blocks due to segmentation of the sieve interval, parallization of the threads into shared memory, use of a “mod 6” wheel, hand-unrolled inner loops, and the small-prime bit-packed optimization.

Input intervals are limited to 10e9, but, for example, sieving a range from 0 to 1e9 takes 64 milliseconds using a M2050 Tesla card. I’d be interested to know how (or even if) the code runs on other hardware.

cheers,

  • ben.