Parallel reduction problem


I’ve been doing the parallel reduction following the code sample in the sdk. The program is simple, just to calculate the sum of all the numbers in an array. But when I compared the result with what the CPU calculates, the results were very different. I then checked and noticed that somehow only the first 8 threads in a block would add up. If I change the block size to 8 instead of 256which is currently what I’m using, then it does give the correct result. Some of the code is listed below. Thanks in advance for any help.

global void add(int *g_idata, int *g_odata)
extern shared int sdata;

// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i= blockIdx.x*blockDim.x+ threadIdx.x;
sdata[tid] = g_idata[i];

// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>0; s>>=1) 
	if (tid < s) 
		sdata[tid] += sdata[tid + s];

// write result for this block to global mem
if (tid == 0) 
	g_odata[blockIdx.x] = sdata[0];


In the main:
// pointer for host memory
int *h_blockResult;
int h_sum=0;
int *h_mean;

// pointer for device memory
int *d_largeArray;
int *d_blockResult;
int *d_mean;

// define grid and block size
const int numBlocks = 1024;
const int numThreadsPerBlock = 256;

// allocate memory on the CPU side

// allocate memory on the GPU side
cudaMalloc( (void **) &d_largeArray, arrayNum*sizeof(int));
cudaMalloc( (void **) &d_blockResult, numBlocks*sizeof(int));
cudaMalloc( (void **) &d_mean, 1*sizeof(int));

// copy largeArray from CPU to GPU
cudaMemcpy(d_largeArray, largeArray, arrayNum*sizeof(int), cudaMemcpyHostToDevice);

// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
add<<< dimGrid, dimBlock >>>(d_largeArray, d_blockResult);
checkCUDAError("kernel error");

// copy blockResult from GPU to CPU
cudaMemcpy(h_blockResult, d_blockResult, numBlocks*sizeof(int), cudaMemcpyDeviceToHost);

// check for any CUDA errors
checkCUDAError("memcpy error");

// sum up all the numbers in each block
for (int i=0; i<numBlocks; i++)
printf(“GPU calculated sum is %d\n”, h_sum);

Hi, I’m not sure what is the problem, but you need to specify shared memory in the executive configuration. It should be as large as the block size * sizeof(int). Also, I recommend “cudaThreadSynchronize(); int rv1 = cudaGetLastError(); if (rv1) …” after the kernel call.

Ken D.