I am trying to compute a series sum within a thread block: Sum = a1 + a2 + a2 + … + aN.
and then return the sum to an output variable. Each thread has to compute a single term in the series.
What is the best way to do this? Is the following kernel code ok?
global void series_sum (int* output) {
shared int sum;
int aN;
int *output;
/* compute aN using the threadIdx */
sum += aN;
__syncthreads(); //wait for sum to accumulate
I am trying to compute a series sum within a thread block: Sum = a1 + a2 + a2 + … + aN.
and then return the sum to an output variable. Each thread has to compute a single term in the series.
What is the best way to do this? Is the following kernel code ok?
global void series_sum (int* output) {
shared int sum;
int aN;
int *output;
/* compute aN using the threadIdx */
sum += aN;
__syncthreads(); //wait for sum to accumulate
That code won’t work correctly - you have a race on the shared memory accumulator. You will need to either use shared memory atomic functions (which will effectively serialize access), or do an in shared memory parallel reduction. The second is preferable if this is anything other than a trivial computation.
That code won’t work correctly - you have a race on the shared memory accumulator. You will need to either use shared memory atomic functions (which will effectively serialize access), or do an in shared memory parallel reduction. The second is preferable if this is anything other than a trivial computation.