Hi everyone.

I am trying to write a code for cuda, in which threads each compute a number and then I want to find the sum of these numbers per block.

so in the kernel code if I have

sum += x[tx][ty] + y[tx][ty];

and then synchronize threads, would this find the sum of all the x and y places for each block?

If not what is the possible coding style?

Thanks