I am trying to compute a series sum within a thread block: Sum = a1 + a2 + a2 + … + aN.

and then return the sum to an output variable. Each thread has to compute a single term in the series.

What is the best way to do this? Is the following kernel code ok?

**global** void series_sum (int* output) {

**shared** int sum;

int aN;

int *output;

/* compute aN using the threadIdx */

sum += aN;

__syncthreads(); //wait for sum to accumulate

*output = sum;

}