How can I Convert my Factorial c program in Cuda , it working good in c,but I would like to make thi

Hi all,

I have been trying to covert my Factorial of 4 program.c into Cuda Factorial program.But I am unable to do so. Following is my Cprogram(Factorialof4.c).I wrote a cuda program for it(see Attachment)But it gives wrong output.

Tell Me How can I convert this program in Cuda?

Thanks in Advance


int Fact(int *a,int n)


	int fact,i;

	fact =1;








int main()


	int *a,n,factorial;



	printf("Factorial of %d=%d",n,factorial);

	return 0;

} (459 Bytes)

Since multiplication is an associative operation, you could use “reduction”.

Check out the “Thrust” SDK from NVIDIA. That can make your job trivial… But yeah, I think you should code it urself for the fun.

Just from the top of my head:

You have N elements, let each thread block handle

el_per_block = N/NUM_blocks

Now within each thread block, let’s say you’ve allocated 256 threads and el_per_block is 2048.

You then let each thread handle 8 muliplications, thread one goes from 0,1,2,…,7 and thread2 handles elements 8,9…15 etc,…

The next step is to do the reduction that sarnath mentioned which means you are multiplying and storing all the values that each thread has computed.

So if thread one did:

val1 = (i)(i-1)(i-2)(i-3)…(i-7)

And thread 2 did:

val2 = (i-8)…(i-15)

You would in the reduction step be multiplying val1*val2.

After the reduction is finished that thread block is finished doing its factorial operations on those 2048 elements and the answer is written back to global memory. Meanwhile you’ve had maybe another 511 blocks that have done the same computation since your N = 512*2048 = 1048576.

This means that your final step is to run kernel number two which multiplies and stores all the values that you’ve already computed in kernel number one.

Nvidia has some great examples of reduction code and i think they put up good slides of it somewhere.

I hope this was more helpful than confusing :)

Just for general awareness of the usefulness of CUDA - this is probably not a good application. The largest factorial that fits in a 32 bit integer is 12, the largest that fits in a 64 bit integer is 20. Twenty integer multiplies take almost no time on a cpu and will definitely not see any benefit to being performed with CUDA; the total amount of parallelism would be very limited. For using CUDA to make sense you would have to be calculating the factorials of large numbers and to do that you would need a way of handling arbitrary length integers.

If you just want practice writing reductions by hand, I would recommend using an operator like + or max instead of multiply since it would be much harder to overflow the result.

Brilliant catch!

Agreed :)

Finding a factorial of a number using cuda is not giving you any advantage. This is a sequential problem. Cuda is meant for data parallelization, 1000s of cuda cores help you to manage parallel threads but finding a factorial is to expect the dependency of one iteration of multiplication on the previous one.

wow, reviving a 9 year old thread.

concerning the statement that it’s a sequential problem. this is wrong

Above statement is true.

Consider the computation of factorial(9): 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9

first step
compute 23 on thread 0 -> tmp 0
compute 4
5 on thread 1 -> tmp 1
compute 67 on thread 2 -> tmp 2
compute 8
9 on thread 3 -> tmp 3

second step
multiply tmp0 and tmp1 on thread 0
multiply tmp2 and tmp3 on thread 1

third step
for the final result multiply both intermediate results from step 2

This is called a parallel reduction. It takes ceil(log2( size of factorial - 1 )) steps.

You’d need to use a GPU bignum library to calculate really large factorials. And this indeed is an area where GPUs can really shine.

let me recommend nvidia lab’s xmp library which uses CUDA.