I want to sum up a matrix located in shared memory which has the size warp size times warp size.
Is there an efficient way to do this?
I was thinking of a tree-like structure, where in the first run half of the threads sum up two elements and write their result; sync, than 1/4 of the threads sum up two elements; sync; and so on until only one element is left. Is this a good idea?
Thanks in advance!