Convert a c++ PI program to cuda

Hi, how do you start converting a c++ program that calculates pi to a cuda program. This is the c++ code

#include "stdafx.h"

#include <stdio.h>

int main ()


  unsigned long i = 0;

  double j, pi = 2;

while (1)


	i += 2;

	j = (double) i * i;

	j /= j - 1;

	pi *= j;

	printf ("pi: %.20f\n", pi);



Hmm, not a good question to ask, since it’s unclear what your goal is.

The problem is that the code you posted would be easy to do in CUDA, pretty much unchanged, but pretty much pointless… it’s single threaded and not the most efficient way to compute Pi if that’s really your goal.

It looks like you just want some kind of “hello world” CUDA program, and you can start by looking at all the SDK sample programs.

I see, how can you use more threads for calculating pi?

Can you give an example?

you can do it very well i think

first compute your j for every thread:

float j = threadIdx.x +blockDim.x*blockId.x;

j *=2;

j = j*j;

j = j/(j-1);

at this point you need to do a vector collapse on all the j using multiplication instead of addition. you can see a very good example for that in the sdk. a simple implementation is easy to do.

oh and you need to decide if you want to do this with doubles or floats, and of course make sure you have the supporting hard were.

good luck

If you’re trying to write a CUDA program to calculate the value of PI, perhaps a better method would be to use a monte carlo simulation, since each thread could work independently of the others (I don’t think that applies to your method shown above).

Monte Carlo Example:

You could just run the simulation X number of times in each thread, and for each thread, maintain a count of the number of points x_n inside the circle. Then, do a little reduction to sum the number of “inside” points found in each thread, and divide the sum by (X * n, where n is the number of threads used). The advantage here is that CUDA should allow you to run this simulation much faster than on a CPU since you’ll be able to run many threads simultaneously, which should also give you a more accurate result for a given amount of computation time.

actually each thread dose compute its own part of the solution. and then you need to collapse the result. I have done this a few times when i implemented a CG solver. it works very well. but the method you mentioned seems like a good candidate as well.

This is the method of monte carlo, buth with this program you can not see the time difference between calculating with cuda or calculating it with a normal processor

// pi4.cpp : Defines the entry point for the console application.


#include "stdafx.h"

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

int main ()


  int i, punten, binnen;

  float x, y;

srand ( time(NULL) );

  printf ("Geef het aantal punten: ");

  fflush (stdout);

  scanf ("%d", &punten);

  binnen = 0;

  for (i = punten; i > 0; i--) {

	x = (float) rand() / (RAND_MAX + 1.0f);

	y = (float) rand() / (RAND_MAX + 1.0f);

	if (x * x + y * y < 1.0f) {




printf ("Pi is ongeveer %f\n", (float) binnen / punten * 4.0f);

return 0;


again, you have to do some work, its not magic, here you can build an array of binnen from each thread, and then collapse it. although i don’t think there is a rand function in cuda. but there is a way to get some kind of rand.

Mhm, i will try this, the first program is the best to show the effect of cuda think…
Because when you run the program, you can see that it takes 3 or 4 seconds for the calculating stops

If this is meant to be more than a toy project, you should look into how pi is actually computed on cpus. Ramanujan found some very nice formulas. And I believe the Brent-Salamin algorithm is now the method of choice, it is very sequential, but it DOUBLES the number of correct digits every step. Wikipedia says that after 25 steps this algorithm has achieved the first 45 million digits.

I think a parallel method of calculating pi that is faster than these algorithms is more a (probably very hard) number theory problem than a CUDA one.

And then there is the problem of needing arbitrary precision arithmetic…

I’d say that the parallelization could come from computing the arithmetic in a parallel fashion - there should be some opportunity for breaking up a 45 million digit wide multiplication, for instance.

that’s just parallelizing the arbitrary arithmetic portion. Which considering the size of the numbers, you’re right, is probably enough.

But parallelizing the actual pi computation algorithm itself is a entirely different beast.

This question came up briefly before. It seemed that the two best candidate algorithms were:

Using cuFFT, since a multiplication of arbitrary precision can be turned into an fft convolution. I don’t know exactly, but seemed like a great idea. This is a major method of how pi’s calculated on the CPU.

Using the formula that lets you compute any arbitrary hexadecimal digit of pi. Each thread does a digit. I’m not exactly sure how it works. This doesn’t get used on CPUs, but I got the feeling it wasn’t a much slower algorithm vs the mainstream ones.

I think in terms of difficulty, I had the impression #1 was harder. C code for #2 is available and is short, and porting it to CUDA should be easy (just run one instance per thread). But someone has to double check what the performance for #2 really is.