Parallel reduction problem

lmbdx · November 29, 2010, 8:28pm

Hi,

I’ve been doing the parallel reduction following the code sample in the sdk. The program is simple, just to calculate the sum of all the numbers in an array. But when I compared the result with what the CPU calculates, the results were very different. I then checked and noticed that somehow only the first 8 threads in a block would add up. If I change the block size to 8 instead of 256which is currently what I’m using, then it does give the correct result. Some of the code is listed below. Thanks in advance for any help.

global void add(int *g_idata, int *g_odata)
{
extern shared int sdata;

// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i= blockIdx.x*blockDim.x+ threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();

// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>0; s>>=1) 
{
	if (tid < s) 
	{
		sdata[tid] += sdata[tid + s];
	}
	__syncthreads();
}

// write result for this block to global mem
if (tid == 0) 
	g_odata[blockIdx.x] = sdata[0];

}

In the main:
// pointer for host memory
int *h_blockResult;
int h_sum=0;
int *h_mean;

// pointer for device memory
int *d_largeArray;
int *d_blockResult;
int *d_mean;

// define grid and block size
const int numBlocks = 1024;
const int numThreadsPerBlock = 256;

// allocate memory on the CPU side
h_blockResult=(int*)malloc(numBlocks*sizeof(int));
h_mean=(int*)malloc(1*sizeof(int));

// allocate memory on the GPU side
cudaMalloc( (void **) &d_largeArray, arrayNum*sizeof(int));
cudaMalloc( (void **) &d_blockResult, numBlocks*sizeof(int));
cudaMalloc( (void **) &d_mean, 1*sizeof(int));

// copy largeArray from CPU to GPU
cudaMemcpy(d_largeArray, largeArray, arrayNum*sizeof(int), cudaMemcpyHostToDevice);

// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
add<<< dimGrid, dimBlock >>>(d_largeArray, d_blockResult);
checkCUDAError("kernel error");

// copy blockResult from GPU to CPU
cudaMemcpy(h_blockResult, d_blockResult, numBlocks*sizeof(int), cudaMemcpyDeviceToHost);

// check for any CUDA errors
checkCUDAError("memcpy error");

// sum up all the numbers in each block
for (int i=0; i<numBlocks; i++)
{
h_sum+=h_blockResult[i];
}
printf(“GPU calculated sum is %d\n”, h_sum);

Ken_Domino · November 29, 2010, 9:02pm

Hi, I’m not sure what is the problem, but you need to specify shared memory in the executive configuration. It should be as large as the block size * sizeof(int). Also, I recommend “cudaThreadSynchronize(); int rv1 = cudaGetLastError(); if (rv1) …” after the kernel call.

Ken D.

Topic		Replies	Views
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2185	November 23, 2009
Parallel addition CUDA Programming and Performance	4	3851	November 2, 2008
Simple (honest!) change to parallel reduction example yields bizarre result? CUDA Programming and Performance	1	2440	December 26, 2011
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1800	November 18, 2009
Shared memory and global memory containg different values CUDA Programming and Performance	0	510	February 22, 2011
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2302	August 12, 2009
simple question CUDA Programming and Performance	5	737	August 2, 2011
Simple Inefficient Parallel Addition CUDA Programming and Performance	5	3158	April 10, 2009
problem with shared mamery CUDA Programming and Performance	4	3176	May 11, 2009
Incorrect result while using shared memory to get maximum value CUDA Programming and Performance	3	368	November 20, 2021

Parallel reduction problem

Related topics