I hope it is the right place to ask about possible algorithms. I have got a big array of numbers (possibly more than 1 million elements) on my device and I want to perform a special summation on the elements contained in the array.
I’d like the cell i of my array to contain the sum of cells j with j<i.

array[i] = sum_{j<i} array[j]

And this should be done for all cells.

Do you know any smart parallel algorithm to achieve this ? I cannot think of any strategy for that.

I’d probably do blockwise sums first (maybe subpartitioning multiple times), so that a subsequent kernel can quickly generate the start value for each block.

Alternatively, you can do a FFT on your data, multiply by 1/frequency, and transform back.

I’d probably do blockwise sums first (maybe subpartitioning multiple times), so that a subsequent kernel can quickly generate the start value for each block.

Alternatively, you can do a FFT on your data, multiply by 1/frequency, and transform back.

Unless I am misunderstanding something, that is just a standard prefix sum. There are “off the shelf” implementations in both cudpp and thrust that you can use for this without needing to write a single line of device code.

Unless I am misunderstanding something, that is just a standard prefix sum. There are “off the shelf” implementations in both cudpp and thrust that you can use for this without needing to write a single line of device code.