hi Folks:
I am doing an example about reduction SDK. Basically, I want to sum all entries of an vector together. So I input a vector of size 78400 X 1, with all entries being 1. So the sum should be 78400. I figure reduction can do that, so I tried, and following is my kerbel:
#define blockSize 128
#include <stdio.h>
__device__ inline void atomicAdd(double *address, double value)
{
unsigned long long oldval, newval, readback;
oldval = __double_as_longlong(*address);
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval)
{
oldval = readback;
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
}
}
//template <unsigned int blockSize>
__global__ void myreduce
( double *g_odata,
double *g_idata,
unsigned int n
)
{
__shared__ double sdata[blockSize];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n)
{
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32)
{
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
//if (tid == 0) g_odata[blockIdx.x] = sdata[0];
if (tid == 0) atomicAdd(&g_odata[0], sdata[0]);
}
The g_idata is a 78400 X 1 vector with all entries being 1, g_odata is initiated as a 5 X 1 vector all all entries being 0; n is the length of the input, which is 78400; After getting the sum and saving it in sdata[0], I used atomicAdd() to put into g_odata[0]; If everything goes right, the g_odata should be [78400 0 0 0 0]'.
However, instead of getting 78400, I kept getting 8575, which does not make much sense to me. I am not sure what is the cause of the problem, perhaps the block size, which I think I have monitored and I am sure it is 128;
Any pointer/hint is highly appreciated. Thanks a lot.