How aggregate series on Cuda?

Hi all,
I am trying to think of effective way of aggregating series of int values on cuda.

On input: 144036510=5256000 int words.

On output: 365*10=3650 float words, every word - average of 1440 int values

From one side, its not just reduction as such I can’t imagine straight forward way to apply Multiple Adds/Thread from SDK Reduction example.

From the other side, many of 1440-aligned averaged groups will not align to 256 address thus reduction within the every group doesn’t sound effective as well.

I was thinking about Atomic operations but have not tried yet.

Reading through texture fetch units also doesn’t sound effective as data are not reused and no caching is required, no logic on wrapping/normalizing is required as well.

Could you please advise something?

The first stage of the classic parallel reduction is perfect for that application - you effectively want to reduce 1440 values into a sum and then compute a mean from that sum for 3650 subsets. That can be done easily with one block per subset. You can see a similar application of the reduction to column summation of a matrix here..

Many thanks, will have a look.