reduction algorithm summing 400 Bytes range 0 upto 255


What is the best performance algorithm to sum N Bytes (N=400) on card support compute capability 1.3 ?


This is a linear programming problem :D

Check reduction sample in SDK.

I would have thought the best performing algorithm was probably a memcpy back to the host followed by a serial loop on the CPU. Who would go to the trouble of implementing a parallel reduction for 400 bytes?

The guy that needs the result in the next iteration of the algorithms innermost loop.