Hey Everyone,
I’m trying to implement the reduction code from the SDK examples in one of my pieces of code, but I’m having some trouble. My test code just fills array, a, with sequential numbers. When the array size grows somewhere past 5000, the result ends up incorrect. Below is the code that I’m using:
#define N 6000
#define nTU 1024
#define nBU 6
__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void SUM(int *a, int *b, int *c){
int i = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
int x;
__shared__ int sdata[nTU];
sdata[threadIdx.x] = a[i];
__syncthreads();
for(unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if(threadIdx.x==0){
b[blockIdx.x] = sdata[0];
}
if(threadIdx.x==0){
__threadfence();
unsigned int value = atomicInc(&count, gridDim.x);
isLastBlockDone = (value == (gridDim.x - 1));
}
__syncthreads();
if(isLastBlockDone){
if(threadIdx.x==0){
for(x=0;x<nBU;x++)
c[0] += b[x];
count = 0;
}
}
__syncthreads();
}
[code]
This is how I am calling the kernel.
[code]
SUM<<<nBU, nTU>>>(d_A, d_B, d_C);
When the array size is equal to 6000, the results are as follows:
CPU: 17997000
GPU: 18005128
Thanks for any help!