I am trying to write a code for cuda, in which threads each compute a number and then I want to find the sum of these numbers per block.
so in the kernel code if I have
sum += x[tx][ty] + y[tx][ty];
and then synchronize threads, would this find the sum of all the x and y places for each block?
If not what is the possible coding style?