Simple Inefficient Parallel Addition


I am attempting to learn CUDA, and I have experience with OpenMP, MPI, and pthreads. I want to try to implement a naive summation as a parallel reduction. I realize it is not efficient, but it will help me be sure I am learning CUDA correctly, and I can’t find any examples similar. So here is the kernel that does not provide me with the correct overall sum:

global void sumArray(float *input_cu, float *sum_cu, int blockSize, int numPoints, int numThreads)
// control variables
int pid = threadIdx.x;
int startIndex = (pid * blockSize);
int stopIndex = (startIndex + blockSize - 1);
if(pid == (numThreads - 1))
stopIndex = (numPoints - 1);

// overall sums, MUST put in shared, use extern to defer sizing upon declaration
extern shared float sums;

// find local sum
float localSum = 0.0;
for(int lcv = startIndex; lcv <= stopIndex; lcv++)
localSum = localSum + input_cu[lcv];

// update overall sum array
sums[pid] = localSum;

// update global sum (KDE_N pointer)
if(pid == 0)
*sum_cu = 0;
for(int lcv = 0; lcv < numThreads; lcv++)
*sum_cu = *sum_cu + sums[lcv];


And below is the context it is called from. I have a 8600GT NVIDIA card, and the data set is roughly 16 million float values. Calling context:

cudaMemcpy(data_cu, data, numPoints * sizeof(float), cudaMemcpyHostToDevice);
float KDE_N = 0.0;
float KDE_N_cu;
*)&KDE_N_cu, sizeof(float));
sumArray<<<1,512>>>(data_cu, KDE_N_cu, (numPoints / 512), numPoints, 512);
cudaMemcpy(&KDE_N, KDE_N_cu, sizeof(float), cudaMemcpyDeviceToHost);
printf(“N:\t%3.0f\n”, KDE_N);

float KDE_N_test = 0.0;
for(int lcv = 0; lcv < numPoints; lcv++)
KDE_N_test += data[lcv];
printf(“N_chk:\t%3.0f\n”, KDE_N_test);

Thanks for your help!

Take a look at the ‘reduction’ sample in the SDK, and the document that goes along with it. That implements several progressively more optimized kernels that do a parallel summation.

Yes, I saw those, and they look good. But I want to know why this is not working just for a basic understanding.

I’m not sure if this is the cause of your problem, but you do have a memory bug.

When you declare your shared array as extern:

extern __shared__ float sums[];

you also must specify the number of bytes of shared memory when you call the kernel (optional 3rd parameter in the <<<>>>).

You need something like:

sumArray<<<1,512,sizeof(float)*512>>>(data_cu, KDE_N_cu, (numPoints / 512), numPoints, 512);

This will tell the CUDA driver to give each block sufficient space for sums array to have enough room to hold the sum from each of your 512 threads. Your current code allocates no shared memory to the sum array, which means you are writing to unreserved shared memory. Since you only have 1 block, this probably isn’t causing the incorrect sum, but it could be a problem if you had more blocks.

Im with seibert. Other than that i could not spot anything logicaly wrong with it… but hey its 12:14am.

There is something you absolutely must not do though, performance wise.
*sum_cu = *sum_cu + sums[lcv];

this reads and writes global memory for every step of the loop.
Accumulate in a thread variable and write to global mem only once.

I think that is my problem.

I tried making this memory declaration (just as a sanity check):

shared float sums[1000]

And it worked correctly. So you are correct in saying the memory is not setup right.

Thanks for your help!