 # Accumulate value within block

I am trying to compute a series sum within a thread block: Sum = a1 + a2 + a2 + … + aN.
and then return the sum to an output variable. Each thread has to compute a single term in the series.

What is the best way to do this? Is the following kernel code ok?

global void series_sum (int* output) {

shared int sum;
int aN;
int *output;

/* compute aN using the threadIdx */
sum += aN;
__syncthreads(); //wait for sum to accumulate

*output = sum;
}

I am trying to compute a series sum within a thread block: Sum = a1 + a2 + a2 + … + aN.
and then return the sum to an output variable. Each thread has to compute a single term in the series.

What is the best way to do this? Is the following kernel code ok?

global void series_sum (int* output) {

shared int sum;
int aN;
int *output;

/* compute aN using the threadIdx */
sum += aN;
__syncthreads(); //wait for sum to accumulate

*output = sum;
}

That code won’t work correctly - you have a race on the shared memory accumulator. You will need to either use shared memory atomic functions (which will effectively serialize access), or do an in shared memory parallel reduction. The second is preferable if this is anything other than a trivial computation.

That code won’t work correctly - you have a race on the shared memory accumulator. You will need to either use shared memory atomic functions (which will effectively serialize access), or do an in shared memory parallel reduction. The second is preferable if this is anything other than a trivial computation.

Yes, I see the race condition. Thanks. No wonder I was getting unreproducible sums. I think I will perform a parallel reduction since it is O(log N)

Yes, I see the race condition. Thanks. No wonder I was getting unreproducible sums. I think I will perform a parallel reduction since it is O(log N)

I too have this problem of race conditions . Will you please explain what is meant by parallel reduction ??

I too have this problem of race conditions . Will you please explain what is meant by parallel reduction ??

Here is a great document by Mark Harris: http://www.cs.bham.ac.uk/~drg/cuda/reduction.pdf

Here is a great document by Mark Harris: http://www.cs.bham.ac.uk/~drg/cuda/reduction.pdf

Thank you very much .

Thank you very much .

This post helped me out. At the very end it has code showing parallel reduction
http://forums.nvidia.com/lofiversion/index.php?t167370.html

This post helped me out. At the very end it has code showing parallel reduction
http://forums.nvidia.com/lofiversion/index.php?t167370.html

I posted some code on this before: http://forums.nvidia.com/lofiversion/index.php?t177324.html

I posted some code on this before: http://forums.nvidia.com/lofiversion/index.php?t177324.html