Easyway to compute the sum of the array?

hspnew · February 9, 2008, 5:15pm

Each thread of the block computes the output and upload that output into corresponding locations in array in global memory.

Is there an easy, efficient way to compute the sum of the array ?
Either from calling within the kernel, or outside the kernel…

Outside the kernel I mean, using some runtime library… like CUBLAS, …

Thank you

DenisR · February 10, 2008, 6:21am

the reduction example in the SDK does exactly that.

jordyvaneijk · February 11, 2008, 9:03pm

Doesn’t this exaple only works with a compute capability 1.1 maybe you can also take a loot at CUDPP library (Parallel Prefix Sum)

DenisR · February 12, 2008, 5:25pm

No, this example works for all versions. I think all examples that use atomic ops have atomic in their name.

Sarnath · February 13, 2008, 9:19am

Let me give my rough take on this problem. This may or may not be efficient as I have not even coded it. I am designing it inside this text box.

SUM_KERNEL(float *g_in, float *g_out)

{

Divide the input array into “N” distinct subsets. N is the number of blocks that

you are going to spawn.
Now, each block has to operate on roughly “TOTAL/N” amount of data-items.

Let us call this number as “M”.
g_in + (M*blockIdx.x) == SA (Start Address) for any given block
local_sum = 0;

for(i=threadIdx.x; i<M; i+=blockDim.x)

{
```
local_sum    + = SA[i];
```
}

Note that all the global memory fetches are “coalesced”.
At this stage, each thread of a BLOCK holds a partial sum for THAT block.
g_out[blockIdx.x*blockDim.x + threadIdx.x] = local_sum;

Now, this demands that “g_out” requires as many entries as there are “threads”

for your kernel.

}

Now, it is just a question of calling “SUM_KERNEL” repeatedly with shrinking appropriate block numbers.

Also note that there is no synchronisation required between threads of a block.

Also, the shared memory usage is very very less.

So, depending on your hardware invoke the kernel with the correct amount of block and grid dimensions.

For example, On my 8800 GTX, I would call this with “32 threads per block” and 128 blocks. (16 multiprocessors * 8 max_active_blocks = 128 blocks).

That would mean – regardless of the input array size – I can reduce it to a partial sum of 4096 (128*32)elements.

Now, for input array sizes of the order 1 million, this 4096 is a very puny number.

Now, its upto the programmer to decide on optimal block and grid sizes and squeeze out performance.

Topic		Replies	Views
Question regarding summing up outputs Summing outputs from each thread CUDA Programming and Performance	10	8027	March 12, 2008
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	851	July 16, 2010
CUDA - calculation of a sum CUDA Programming and Performance	7	5408	April 30, 2010
Interpretation of Kernel CUDA Programming and Performance	4	3082	August 11, 2009
Calculation sum of array parts have large prime number elements CUDA Programming and Performance	5	1845	December 23, 2009
total sum example CUDA Programming and Performance	3	7068	December 2, 2015
Need help with summing results from different blocks CUDA Programming and Performance	3	2521	May 10, 2010
Accumulate value within block CUDA Programming and Performance	15	3063	October 16, 2010
Parallel addition CUDA Programming and Performance	4	3851	November 2, 2008
finding sum CUDA Programming and Performance	1	2503	November 18, 2007