a short reduction example cannot get what is expected

hi Folks:

I am doing an example about reduction SDK. Basically, I want to sum all entries of an vector together. So I input a vector of size 78400 X 1, with all entries being 1. So the sum should be 78400. I figure reduction can do that, so I tried, and following is my kerbel:

#define blockSize 128

#include <stdio.h>

__device__ inline void atomicAdd(double *address, double value) 

{

    unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

    newval = __double_as_longlong(__longlong_as_double(oldval) + value);

    while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) 

	{

        oldval = readback;

        newval = __double_as_longlong(__longlong_as_double(oldval) + value);

    }

}

//template <unsigned int blockSize>

__global__ void myreduce

(	double *g_odata, 

	double *g_idata, 

	unsigned int n

) 

{

	__shared__ double sdata[blockSize];

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*(blockSize*2) + tid;

	unsigned int gridSize = blockSize*2*gridDim.x;

	sdata[tid] = 0;

	while (i < n)

	{

		sdata[tid] += g_idata[i] + g_idata[i+blockSize]; 

		i += gridSize; 

	}

	__syncthreads();

	if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

	if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

	if (blockSize >= 128) { if (tid < 64)  { sdata[tid] += sdata[tid + 64]; }  __syncthreads(); }

	if (tid < 32)

	{	

		if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

		if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

		if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

		if (blockSize >= 8)  sdata[tid] += sdata[tid + 4];

		if (blockSize >= 4)  sdata[tid] += sdata[tid + 2];

		if (blockSize >= 2)  sdata[tid] += sdata[tid + 1];

	}

	//if (tid == 0) g_odata[blockIdx.x] = sdata[0];

	if (tid == 0) atomicAdd(&g_odata[0], sdata[0]);

}

The g_idata is a 78400 X 1 vector with all entries being 1, g_odata is initiated as a 5 X 1 vector all all entries being 0; n is the length of the input, which is 78400; After getting the sum and saving it in sdata[0], I used atomicAdd() to put into g_odata[0]; If everything goes right, the g_odata should be [78400 0 0 0 0]'.

However, instead of getting 78400, I kept getting 8575, which does not make much sense to me. I am not sure what is the cause of the problem, perhaps the block size, which I think I have monitored and I am sure it is 128;

Any pointer/hint is highly appreciated. Thanks a lot.

If you omit the __syncthreads() once you are within a single warp, you need to declare [font=“Courier New”]sdata[/font] as volatile. This will particularly bite you on compute capability 2.x and higher.

yes, that works. Could you please explain a little bit more on that? What makes the difference? A comparison between the old and the new case? Thanks alot.

Cool.