The parallel reduction example in SDK seems to need an input array that is a power of 2 (unless I have missed something). Does someone know of a way to extend it to an array of any size?

I was earlier under the impression that kernel 6 in the sample could handle input array of arbitrary size, but am not sure…

It is pretty trivial to get the reduction to work for any input size. Just have each thread in a block perform the partial reduction to shared memory or register using gridsize stride until they go past the end of the input data (so there will be some divergence on the last summation pass), then do the second phase of the reduction in shared memory with some or all of the threads in the block. The key requirement is that the number of threads per block needs to be a power of two if you are using the whole block to do the shared memory reduction. If you just use a warp, then that doesn’t even have to apply.