Taking sum of n floating point numbers

asherimtiaz · May 1, 2008, 4:23pm

Hi guys,

can anyone suggest the most efficient way to sum n floating point numbers on cuda?

thanks

-asher

seibert · May 1, 2008, 5:03pm

Can you give an order of magnitude size for n?

DenisR · May 1, 2008, 6:00pm

well, I can’t think of something better than the reduction example.

asherimtiaz · May 2, 2008, 1:36pm

n around 64K…

seibert · May 2, 2008, 2:18pm

CUDPP also says that is has parallel reduction code which you can call:
[url=“http://www.gpgpu.org/developer/cudpp/”]http://www.gpgpu.org/developer/cudpp/[/url]

(I can only find parallel prefix scan functions in the documentation, but that is a more general case of parallel reduction, so maybe that is what they mean.)

jimh · May 2, 2008, 2:59pm

I second Denis’s suggestion. Modify the reduction sample.

DenisR · May 2, 2008, 6:20pm

I think there is even no modification needed, just use the reduction example code with the right N. The reduction in CUDPP is (or will be) I believe exactly the same as in the example.

jimh · May 5, 2008, 5:34pm

Except the reduction sample works on integers, not floats.

Devaster · May 5, 2008, 5:37pm

replace input int with float - its easy …

jimh · May 5, 2008, 5:44pm

Which is why I said to modify the reduction sample.

Modifying the input type isn’t enough, though. You must modify the shmem declaration and you may need to add code to handle non-power-of-two data sizes.

Devaster · May 5, 2008, 6:08pm

of course , but by my case its all input sizes power of two … External Media

mattb3 · May 6, 2008, 1:31am

A quick and easy way to do this is to use cublasSgemm to do a matrix multiply with a ones vector(vector whose elements are all 1.0f) of the same length as your data. You’ll probably have to write a trivial kernel to initialize your ones vector, but the call to Sgemm is fairly straightforward, just be careful to get the input dimensions correct. My guess is using cublas would be slower then the reduction example, but it would be interesting to see how much.

mfatica · May 6, 2008, 2:19am

If you really want to use a BLAS call, the right one to use is CublasSdot ( you just need a dot product with a unity vector).
It is going to be slower than the reduction.

jimh · May 6, 2008, 7:22pm

The OP mentioned a data size “around 64K” - I would guess that isn’t a power of two until asherimtiaz says otherwise.

Topic		Replies	Views
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3379	April 1, 2008
Best way to face this problem CUDA Programming and Performance	4	1206	May 16, 2010
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10584	December 8, 2009
Easyway to compute the sum of the array? CUDA Programming and Performance	4	8076	February 13, 2008
Summing matrix elements CUDA Programming and Performance	3	6986	July 4, 2011
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1869	January 14, 2009
Simple Inefficient Parallel Addition CUDA Programming and Performance	5	3220	April 10, 2009
Array Sum in cuda CUDA Programming and Performance	5	11556	May 30, 2010
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4562	February 7, 2012
Problem using NPP sum Having trouble using reduction sum with NPP CUDA Programming and Performance	0	1044	August 3, 2011

Taking sum of n floating point numbers

Related topics