finding sum

Hi everyone.

I am trying to write a code for cuda, in which threads each compute a number and then I want to find the sum of these numbers per block.

so in the kernel code if I have
sum += x[tx][ty] + y[tx][ty];

and then synchronize threads, would this find the sum of all the x and y places for each block?

If not what is the possible coding style?


That will not work: multiple threads are reading/writing sum simultaneously which will obviously lead to undefined results.

For an example how to do this, look at the scalaProd example or the scan example in the SDK. In particular, the scan whitepaper has a well written description: you only need the upsweep phase.

There is also a reduction code in this post:…l=sum+reduction