about implementation of efficient reduction

Mark Harris’ example of scan does all-prefix-sums, but if only reduction (sum of output from all threads) is needed, is the method described in scan example still the way to go?
Or is there any other example for efficient reduction?


Look at the reduction example in the SDK. There is also a paper online, you can check the examples section at http://www.nvidia.com/object/cuda_home.html

It is used as an example of how to optimize your code and shows the incredible speedups that may be achieved by careful tuning of your code.

have a look at:


for very efficient reduction code

where exactly? I did not see it in the forum and also browsing the source I could not find reduction code. Can you give a hint, since I am using a lot of reductions and would like to see if I can squeeze some more performance out.

Actually, the current release of CUDPP doesn’t include straight reductions, just prefix sums (scan). I believe the next release will include reductions.