Simple Inefficient Parallel Addition

Karl_Herb · April 10, 2009, 12:37am

Hello:

I am attempting to learn CUDA, and I have experience with OpenMP, MPI, and pthreads. I want to try to implement a naive summation as a parallel reduction. I realize it is not efficient, but it will help me be sure I am learning CUDA correctly, and I can’t find any examples similar. So here is the kernel that does not provide me with the correct overall sum:

global void sumArray(float *input_cu, float *sum_cu, int blockSize, int numPoints, int numThreads)
{
// control variables
int pid = threadIdx.x;
int startIndex = (pid * blockSize);
int stopIndex = (startIndex + blockSize - 1);
if(pid == (numThreads - 1))
{
stopIndex = (numPoints - 1);
}

// overall sums, MUST put in shared, use extern to defer sizing upon declaration
extern shared float sums;

// find local sum
float localSum = 0.0;
for(int lcv = startIndex; lcv <= stopIndex; lcv++)
{
localSum = localSum + input_cu[lcv];
}

// update overall sum array
sums[pid] = localSum;

// update global sum (KDE_N pointer)
__syncthreads();
if(pid == 0)
{
*sum_cu = 0;
for(int lcv = 0; lcv < numThreads; lcv++)
{
*sum_cu = *sum_cu + sums[lcv];
}
}

}

And below is the context it is called from. I have a 8600GT NVIDIA card, and the data set is roughly 16 million float values. Calling context:

cudaMemcpy(data_cu, data, numPoints * sizeof(float), cudaMemcpyHostToDevice);
float KDE_N = 0.0;
float KDE_N_cu;
cudaMalloc((void*)&KDE_N_cu, sizeof(float));
sumArray<<<1,512>>>(data_cu, KDE_N_cu, (numPoints / 512), numPoints, 512);
cudaMemcpy(&KDE_N, KDE_N_cu, sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(KDE_N_cu);
printf(“N:\t%3.0f\n”, KDE_N);

float KDE_N_test = 0.0;
for(int lcv = 0; lcv < numPoints; lcv++)
{
KDE_N_test += data[lcv];
}
printf(“N_chk:\t%3.0f\n”, KDE_N_test);

Thanks for your help!

jack · April 10, 2009, 1:25am

Take a look at the ‘reduction’ sample in the SDK, and the document that goes along with it. That implements several progressively more optimized kernels that do a parallel summation.

Karl_Herb · April 10, 2009, 1:35am

Yes, I saw those, and they look good. But I want to know why this is not working just for a basic understanding.

seibert · April 10, 2009, 3:31am

I’m not sure if this is the cause of your problem, but you do have a memory bug.

When you declare your shared array as extern:

extern __shared__ float sums[];

you also must specify the number of bytes of shared memory when you call the kernel (optional 3rd parameter in the <<<>>>).

You need something like:

sumArray<<<1,512,sizeof(float)*512>>>(data_cu, KDE_N_cu, (numPoints / 512), numPoints, 512);

This will tell the CUDA driver to give each block sufficient space for sums array to have enough room to hold the sum from each of your 512 threads. Your current code allocates no shared memory to the sum array, which means you are writing to unreserved shared memory. Since you only have 1 block, this probably isn’t causing the incorrect sum, but it could be a problem if you had more blocks.

Ailleur · April 10, 2009, 4:13am

Im with seibert. Other than that i could not spot anything logicaly wrong with it… but hey its 12:14am.

There is something you absolutely must not do though, performance wise.
*sum_cu = *sum_cu + sums[lcv];

this reads and writes global memory for every step of the loop.
Accumulate in a thread variable and write to global mem only once.

Karl_Herb · April 10, 2009, 2:01pm

I think that is my problem.

I tried making this memory declaration (just as a sanity check):

shared float sums[1000]

And it worked correctly. So you are correct in saying the memory is not setup right.

Thanks for your help!

Topic		Replies	Views
Parallel reduction problem CUDA Programming and Performance	1	5079	November 29, 2010
device global memory update questions CUDA Programming and Performance	7	5842	April 20, 2009
CUDA - calculation of a sum CUDA Programming and Performance	7	5444	April 30, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1792	January 14, 2009
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2196	November 23, 2009
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1275	November 8, 2023
computing a sum leads to infinite values CUDA Programming and Performance	3	5375	September 16, 2008
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8280	April 15, 2011
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2302	August 12, 2009
Reduce sum in shared memory using CUB CUDA Programming and Performance cuda , kernel , performance	9	123	October 3, 2024

Simple Inefficient Parallel Addition

Related topics