Simple pi-algorithm


As an introduction project I was thinking of developing a pi-evaluator. The algorithm I’m intending to use is very simple, yet suitable for threaded execution.

For all threads:

  1. Generate random floating values between 0-1 for (X,Y)
  2. Determine whether the point is inside the circle-arc by evaluating if absolute distance is less or equal to one.
  3. Report the result. If the point was inside, increase a “hit_count” variable, and a “point_count” variable by one. If it was outside, just increase the “point_count”

Now, in the host program (outside the kernel) take the hit_count:point_count ratio (which should be ~0.78) and multiply this by 4 (as we are only doing this test for one quadrant) to get your approximation of PI. Ofcourse there are several numerical drawbacks with this approach, for instance the precision of the random numbers will limit the amount of correct decimals you will get, and more things.

However, what would be the best solution to do this in a practical sense. I have a few questions regarding this. Where should I generate the random numbers? If I do it in the kernel by say rand(), all threads get the value. Making a RNG in the kernel of some seed will be a bit too slow, right? How should I store the hits/points variables? All suggestions in general would be appriciated!

Many thanks!

This isn’t as simple an introductory project as you might think!

There is no rand() built-in function in CUDA. However, the MersenneTwister example in the SDK demonstrates how to generate random numbers in parallel - essentially you run multiple generators in parallel with different initial seeds.

It’s not possible to increment a global counter from parallel threads as you suggest without atomic operations, which are only available on devices with compute capability 1.1. Instead you could write out 1s where the points are inside the circle, and then perform a parallel sum reduction to get the final count.

Hope this helps.

Hi Simon.

Thanks for your response.

I was a little bit interested in the MarsenneTwister already, I’ll definately have a look at that. Regarding the global counter, I had already that solution in mind. Like use a grid(x,1,1) and increment the “local” slot (thid) by one as this operation wouldnt have to be atomic. I’m certain what a parallel sum reduction means? Would really appriciate if you could elaborate otherwise!

By Parallel Sum Reduction, pretty sure he means using the device to add up a bunch of numbers before copying them back to the CPU. On code I’ve written I have each thread increment a counter in shared memory then at the end add up all the counters and copy to global memory to be copied to host memory.

Look at the scan example in the SDK for information on parallel reductions.

Nathan, thanks a lot.

I understand then. Would you like to show me the lines that accomplish that? you need to synchronize for this to work?

Thanks, great tip!