Taking sum of n floating point numbers

A quick and easy way to do this is to use cublasSgemm to do a matrix multiply with a ones vector(vector whose elements are all 1.0f) of the same length as your data. You’ll probably have to write a trivial kernel to initialize your ones vector, but the call to Sgemm is fairly straightforward, just be careful to get the input dimensions correct. My guess is using cublas would be slower then the reduction example, but it would be interesting to see how much.