Parallel addition

User_CUDA · October 30, 2008, 12:56pm

I am fairly new to GPU computations and I am looking at summing up of values using the GPU. I have n arrays for which the sum has to be found. Is there a way in which I can designate say x number of threads to each of the arrays and do a parallel reduce? How would I designate the blocks and the threads per block in this case?

E.D_Riedijk · October 30, 2008, 7:51pm

You can find out everything you want to know in the reduction example in the SDK

User_CUDA · November 2, 2008, 12:39pm

But to my understanding, the reduction example in the SDK takes a single array of elements and does the reduction. This will work out for a single large array of elements. I did not understand how to modify the reduction code to handle additions over ‘n’ smaller arrays. Does repeated calling of the function help?

E.D_Riedijk · November 2, 2008, 5:51pm

Sorry missed that bit ;)
One option is to have a 2D grid, one dimension that is used like in the example, the other dimension can be the index of the array to reduce (1…n). Like that you can reduce all arrays in 1 kernel call.

XFer · November 2, 2008, 5:58pm

You could try adapting this very simple code (sum of 2 arrays):

__global__ void GPUSum (

							   const float *array1,

							   const float *array2, 

							   float *arrayRes, 

							   const int Elements

								 )

{

  unsigned int index = __umul24(blockIdx.x, blockDim.x) +	

					   threadIdx.x;

if (index < iSize)

 	 arrayRes[index] = array1[index] + array2[index];

}

You’ll need 1 thread per pixel.

So in the main() you’ll do:

#define THREADSxBLOCK  256

[…]

iArraySize = iRows * iCols;

dimBlock   = (dim3) make_uint3 (THREADSxBLOCK,1,1);

dimGrid	= (dim3) make_uint3 (iArraySize/THREADSxBLOCK,1,1);

GPUSum<<<dimGrid, dimBlock>>> (

									 devArray1, 

									 devArray2, 

									 devArrayRes, 

									 iArraySize

										);

[…]

I’ve omitted all the malloc and host<->GPU transfer stuff of course.

Fer

Topic		Replies	Views
Parallel reduction problem CUDA Programming and Performance	1	5079	November 29, 2010
total sum example CUDA Programming and Performance	3	7088	December 2, 2015
Parallel Addition ? How can i serialize parts at kernel? CUDA Programming and Performance	4	2934	August 16, 2009
Simple (honest!) change to parallel reduction example yields bizarre result? CUDA Programming and Performance	1	2440	December 26, 2011
Parallel Reduction CUDA Programming and Performance	2	1158	July 8, 2010
Question regarding summing up outputs Summing outputs from each thread CUDA Programming and Performance	10	8027	March 12, 2008
Thread cooperative addition CUDA Programming and Performance	1	1635	June 3, 2008
scatter and gather with CUDA? CUDA Programming and Performance	3	9835	March 9, 2009
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	852	July 16, 2010
Summing a 2-d array across one dimension CUDA Programming and Performance	2	3033	September 10, 2008

Parallel addition

Related topics