Accumulate value within block

Moiz_Ahmad · October 14, 2010, 8:03am

I am trying to compute a series sum within a thread block: Sum = a1 + a2 + a2 + … + aN.
and then return the sum to an output variable. Each thread has to compute a single term in the series.

What is the best way to do this? Is the following kernel code ok?

global void series_sum (int* output) {

shared int sum;
int aN;
int *output;

/* compute aN using the threadIdx */
sum += aN;
__syncthreads(); //wait for sum to accumulate

*output = sum;
}

Moiz_Ahmad · October 14, 2010, 8:03am

I am trying to compute a series sum within a thread block: Sum = a1 + a2 + a2 + … + aN.
and then return the sum to an output variable. Each thread has to compute a single term in the series.

What is the best way to do this? Is the following kernel code ok?

global void series_sum (int* output) {

shared int sum;
int aN;
int *output;

/* compute aN using the threadIdx */
sum += aN;
__syncthreads(); //wait for sum to accumulate

*output = sum;
}

avidday · October 14, 2010, 12:39pm

That code won’t work correctly - you have a race on the shared memory accumulator. You will need to either use shared memory atomic functions (which will effectively serialize access), or do an in shared memory parallel reduction. The second is preferable if this is anything other than a trivial computation.

avidday · October 14, 2010, 12:39pm

That code won’t work correctly - you have a race on the shared memory accumulator. You will need to either use shared memory atomic functions (which will effectively serialize access), or do an in shared memory parallel reduction. The second is preferable if this is anything other than a trivial computation.

Moiz_Ahmad · October 14, 2010, 5:12pm

Yes, I see the race condition. Thanks. No wonder I was getting unreproducible sums. I think I will perform a parallel reduction since it is O(log N)

Moiz_Ahmad · October 14, 2010, 5:12pm

Yes, I see the race condition. Thanks. No wonder I was getting unreproducible sums. I think I will perform a parallel reduction since it is O(log N)

kamsandh · October 16, 2010, 1:24pm

I too have this problem of race conditions . Will you please explain what is meant by parallel reduction ??

kamsandh · October 16, 2010, 1:24pm

I too have this problem of race conditions . Will you please explain what is meant by parallel reduction ??

Jimmy_Pettersson · October 16, 2010, 3:41pm

Here is a great document by Mark Harris: [url=“http://www.cs.bham.ac.uk/~drg/cuda/reduction.pdf”]http://www.cs.bham.ac.uk/~drg/cuda/reduction.pdf[/url]

Jimmy_Pettersson · October 16, 2010, 3:41pm

Here is a great document by Mark Harris: [url=“http://www.cs.bham.ac.uk/~drg/cuda/reduction.pdf”]http://www.cs.bham.ac.uk/~drg/cuda/reduction.pdf[/url]

kamsandh · October 16, 2010, 3:55pm

Thank you very much .

kamsandh · October 16, 2010, 3:55pm

Thank you very much .

Moiz_Ahmad · October 16, 2010, 4:45pm

This post helped me out. At the very end it has code showing parallel reduction
[url=“http://forums.nvidia.com/lofiversion/index.php?t167370.html”]The Official NVIDIA Forums | NVIDIA

Moiz_Ahmad · October 16, 2010, 4:45pm

This post helped me out. At the very end it has code showing parallel reduction
[url=“http://forums.nvidia.com/lofiversion/index.php?t167370.html”]http://forums.nvidia.com/lofiversion/index.php?t167370.html[/url]

Jimmy_Pettersson · October 16, 2010, 5:17pm

I posted some code on this before: [url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/lofiversion/index.php?t177324.html[/url]

Jimmy_Pettersson · October 16, 2010, 5:17pm

I posted some code on this before: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

Topic		Replies	Views
Parallel Addition ? How can i serialize parts at kernel? CUDA Programming and Performance	4	3008	August 16, 2009
Question regarding summing up outputs Summing outputs from each thread CUDA Programming and Performance	10	8179	March 12, 2008
Parallel Reduction CUDA Programming and Performance	2	1234	July 8, 2010
Thread cooperative addition CUDA Programming and Performance	1	1690	June 3, 2008
Problem in parellel sum CUDA Developer Tools	0	412	November 13, 2020
finding sum CUDA Programming and Performance	1	2556	November 18, 2007
Interpretation of Kernel CUDA Programming and Performance	4	3161	August 11, 2009
Simple Inefficient Parallel Addition CUDA Programming and Performance	5	3267	April 10, 2009
Array Sum in cuda CUDA Programming and Performance	5	11608	May 30, 2010
[Newbie] Operations of type "+=" on the shared memory not working as expected CUDA Programming and Performance	2	2111	February 10, 2009

Accumulate value within block

Related topics