Using CUDA, how would you efficiently implement an n-sample moving average filter for a 1D array of integers? For discussion, let’s say the array is 2^16 integers long, and that we want “n” (the window length) to be 32. I’m not used to “thinking parallel” – any tips on how to allocate threads to this problem would be appreciated.

Not sure if this is too helpful, but the BoxFilter SDK sample applies a separable moving average filter over a 2D image. (Separable means this is done separately in the horizontal and in the vertical direction). You may get some ideas from this SDK sample.

Re-arrange the 2Ë†16 pixels into a 2D image of 1024 wide, 64 high and run the horizontal box filter on it with a filter width of 32. Replicate 32 pixels from each row’s left side and append it on the right edge of the image, offset by one row. Similarly replicate the rightmost 32 pixels at the left edge of the image, creating 1088 pixels total width. That solves the overlap problem where the original box filter does not filter across rows.

The SDK sample is nicely parallel threaded and quite fast already, so you don’t have to re-invent the wheel. However you may need to modify the data format to suit your needs (integers instead of 8 bit true color RGB pixels).

Do a prefix sum on the array, call that new array s. It’s a sum table.
Then the moving average between A and B is just (s[B]-s[A])/(B-A).
If the moving average is a constant width, the compute of 1/(B-A) is precomputable, and the array lookups can be nicely coalesced (in fact automagically on G200).

If your integers values are full range of 2^32, use two sum arrays, one for the low word and one for the high world.
This would also help for wider ranges, not just higher values.