Hi guys,

can anyone suggest the most efficient way to sum n floating point numbers on cuda?

thanks

-asher

Hi guys,

can anyone suggest the most efficient way to sum n floating point numbers on cuda?

thanks

-asher

Can you give an order of magnitude size for n?

well, I can’t think of something better than the reduction example.

n around 64K…

CUDPP also says that is has parallel reduction code which you can call:

http://www.gpgpu.org/developer/cudpp/

(I can only find parallel prefix scan functions in the documentation, but that is a more general case of parallel reduction, so maybe that is what they mean.)

I second Denis’s suggestion. Modify the reduction sample.

I think there is even no modification needed, just use the reduction example code with the right N. The reduction in CUDPP is (or will be) I believe exactly the same as in the example.

Except the reduction sample works on integers, not floats.

replace input int with float - its easy …

Which is why I said to modify the reduction sample.

Modifying the input type isn’t enough, though. You must modify the shmem declaration and you may need to add code to handle non-power-of-two data sizes.

of course , but by my case its all input sizes power of two …

A quick and easy way to do this is to use cublasSgemm to do a matrix multiply with a ones vector(vector whose elements are all 1.0f) of the same length as your data. You’ll probably have to write a trivial kernel to initialize your ones vector, but the call to Sgemm is fairly straightforward, just be careful to get the input dimensions correct. My guess is using cublas would be slower then the reduction example, but it would be interesting to see how much.

If you really want to use a BLAS call, the right one to use is CublasSdot ( you just need a dot product with a unity vector).

It is going to be slower than the reduction.

The OP mentioned a data size “around 64K” - I would guess that isn’t a power of two until asherimtiaz says otherwise.