I need to add the elements of each row of a matrix of size nxn.
The matrix would be split in different blocks, so the rows wouldn’t
all be in the same block. How can I add different parts of the
rows if I can’t syncronize across different blocks?
I’m afraid you can only use different kernel launching to achieve syncronizing among blocks. So you can use an array of size n to store intermediate results ,that is , the sums of the rows in every block, and go on reducing until the left elements can be processed in one block. You finish it after launching the kernel with that single block.