I have been trying to covert my Factorial of 4 program.c into Cuda Factorial program.But I am unable to do so. Following is my Cprogram(Factorialof4.c).I wrote a cuda program for it(see Attachment)But it gives wrong output.

Tell Me How can I convert this program in Cuda?

Thanks in Advance

#include<stdio.h>
int Fact(int *a,int n)
{
int fact,i;
fact =1;
for(i=0;i<n;i++)
{
fact=fact*(i+1);
}
a=&fact;
return(*a);
}
int main()
{
int *a,n,factorial;
n=4;
factorial=Fact(a,n);
printf("Factorial of %d=%d",n,factorial);
return 0;
}

Now within each thread block, let’s say you’ve allocated 256 threads and el_per_block is 2048.

You then let each thread handle 8 muliplications, thread one goes from 0,1,2,…,7 and thread2 handles elements 8,9…15 etc,…

The next step is to do the reduction that sarnath mentioned which means you are multiplying and storing all the values that each thread has computed.

So if thread one did:

val1 = (i)(i-1)(i-2)(i-3)…(i-7)

And thread 2 did:

val2 = (i-8)…(i-15)

You would in the reduction step be multiplying val1*val2.

After the reduction is finished that thread block is finished doing its factorial operations on those 2048 elements and the answer is written back to global memory. Meanwhile you’ve had maybe another 511 blocks that have done the same computation since your N = 512*2048 = 1048576.

This means that your final step is to run kernel number two which multiplies and stores all the values that you’ve already computed in kernel number one.

Nvidia has some great examples of reduction code and i think they put up good slides of it somewhere.

Just for general awareness of the usefulness of CUDA - this is probably not a good application. The largest factorial that fits in a 32 bit integer is 12, the largest that fits in a 64 bit integer is 20. Twenty integer multiplies take almost no time on a cpu and will definitely not see any benefit to being performed with CUDA; the total amount of parallelism would be very limited. For using CUDA to make sense you would have to be calculating the factorials of large numbers and to do that you would need a way of handling arbitrary length integers.

If you just want practice writing reductions by hand, I would recommend using an operator like + or max instead of multiply since it would be much harder to overflow the result.

Finding a factorial of a number using cuda is not giving you any advantage. This is a sequential problem. Cuda is meant for data parallelization, 1000s of cuda cores help you to manage parallel threads but finding a factorial is to expect the dependency of one iteration of multiplication on the previous one.