One of the algorithms in the code I am porting to CUDA contains lots of summations. I have looked through the reduction and “scan” examples in the SDK and a PDF file on the subject (also from the cuda website), but there are still a few things I didn’t quite understand.
For example, in the code I’m porting, one of the operations involves doing the hadamard product of two matrices (C = A o B, extremely simple in cuda!), and then “collapsing” the resulting matrix into a vector by doing a summation of each column’s elements.
It is the summation over the columns that is not to straight forward. The PDF document I read presented an algorithm for doing summation, but it was only for a linear array where all elements were already known up-front.
In my case the elements are only known after each thread performs it’s own element-wise product.
The only way I could think of solving this is by storing the resulting matrix and then using a separate kernel doing the column sums individually. But this means allocating more storage space (the CPU code needs only the vector), and it needs to make to access matrix “C”'s memory position twice.
Is there a way I could communicate the results between threads in the same column to do the summation?