Problem with reduction

Hey Everyone,

I’m trying to implement the reduction code from the SDK examples in one of my pieces of code, but I’m having some trouble. My test code just fills array, a, with sequential numbers. When the array size grows somewhere past 5000, the result ends up incorrect. Below is the code that I’m using:

#define N 6000

#define nTU	1024

#define nBU 6

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void SUM(int *a, int *b, int *c){

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	int tid = threadIdx.x;

	int x;

	__shared__ int sdata[nTU];

		sdata[threadIdx.x] = a[i];

	__syncthreads();

	for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

    {

        if (tid < s)

        {

            sdata[tid] += sdata[tid + s];

        }

        __syncthreads();

    }

	if(threadIdx.x==0){

		b[blockIdx.x] = sdata[0];

	}

	if(threadIdx.x==0){

		__threadfence();

		unsigned int value = atomicInc(&count, gridDim.x);

		isLastBlockDone = (value == (gridDim.x - 1));	

	}

	__syncthreads();

	if(isLastBlockDone){

		if(threadIdx.x==0){

			for(x=0;x<nBU;x++)

				c[0] += b[x];

			count = 0;

		}

	}

	__syncthreads();

}

[code]

This is how I am calling the kernel.

[code]

SUM<<<nBU, nTU>>>(d_A, d_B, d_C);

When the array size is equal to 6000, the results are as follows:

CPU: 17997000

GPU: 18005128

Thanks for any help!

How do you ensure you do not sum past the end of the array - do you round up the array size to the next multiple of the blocksize and zero the unused part?

In the host code, I only fill the array up to size N. I’m not sure why I didn’t run into this until I hit 5120. I just tried setting the array size to the 6144 (6*1024), and that comes out with the correct result. So you’re question is my answer!

Is there maybe a better way to ensure that I don’t sum past the end of the array? Is there a significant performance hit for simply zeroing the rounded up array size?

FYI - thanks, Tera for your response. The code works after rounding the array up and using memset to zero out the data. Much appreciated.

The other option would of course be to add a [font=“Courier New”]size[/font] argument to the kernel and to check inside the kernel that all accesses stay within the array size. As the kernel is bandwidth limited anyway, the check comes free of cost.