Parallel addition

I am fairly new to GPU computations and I am looking at summing up of values using the GPU. I have n arrays for which the sum has to be found. Is there a way in which I can designate say x number of threads to each of the arrays and do a parallel reduce? How would I designate the blocks and the threads per block in this case?

You can find out everything you want to know in the reduction example in the SDK

But to my understanding, the reduction example in the SDK takes a single array of elements and does the reduction. This will work out for a single large array of elements. I did not understand how to modify the reduction code to handle additions over ‘n’ smaller arrays. Does repeated calling of the function help?

Sorry missed that bit ;)
One option is to have a 2D grid, one dimension that is used like in the example, the other dimension can be the index of the array to reduce (1…n). Like that you can reduce all arrays in 1 kernel call.

You could try adapting this very simple code (sum of 2 arrays):

__global__ void GPUSum (

							   const float *array1,

							   const float *array2, 

							   float *arrayRes, 

							   const int Elements



  unsigned int index = __umul24(blockIdx.x, blockDim.x) +	


if (index < iSize)

 	 arrayRes[index] = array1[index] + array2[index];


You’ll need 1 thread per pixel.

So in the main() you’ll do:

#define THREADSxBLOCK  256


iArraySize = iRows * iCols;

dimBlock   = (dim3) make_uint3 (THREADSxBLOCK,1,1);

dimGrid	= (dim3) make_uint3 (iArraySize/THREADSxBLOCK,1,1);

GPUSum<<<dimGrid, dimBlock>>> (







I’ve omitted all the malloc and host<->GPU transfer stuff of course.