Simple (honest!) change to parallel reduction example yields bizarre result?

Hello resident gurus,

I am trying to understand and modify the parallel reduction example. As I understand it, the only reference to original data is when a sum is copied into shared memory, so I wanted to change this to generate a simple count. Here’s my modification of the reduce6 kernel code:

// we reduce multiple elements per thread.  The number is determined by the 

    // number of active thread blocks (via gridDim).  More blocks will result

    // in a larger gridSize and therefore fewer elements per thread

    while (i < n)


        mySum += 1; //g_idata[i]; -- modified

        // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays

        if (nIsPow2 || i + blockSize < n) 

            mySum += 1; //g_idata[i+blockSize];  -- modified

        i += gridSize;


// each thread puts its local sum into shared memory 

    sdata[tid] = mySum;



So I’m just putting sums of 1’s into shared memory instead of sums of the elements themselves (I’m going to count conditional flags in my real version), and relying on the existing reduction to add them up. Unfortunately, in practice the 16M data points reduce to a total of 64, the same as the number of blocks. This is on Ubuntu 10.10, GTX470. What’s going on?



Here’s another simple example of reduction sum :

main kernel:

template<int els_per_block, int threads>

__global__ void reduce(float* in, float* out)


	__shared__ float smem[64];

	int tid = threadIdx.x + blockIdx.x*els_per_block;

	float sum = 0.0f;

	const int iters = els_per_block/threads;


#pragma unroll

		for(int i = 0; i < iters; i++)


			sum += in[tid + i*threads];




	if(threads == 32) smem[threadIdx.x+32] = 0.0f;

	smem[threadIdx.x] = sum;


	if(threadIdx.x < 32) warp_reduce(smem);


	if(threadIdx.x == 0)

		out[blockIdx.x] = smem[threadIdx.x]; // out[0] == ans


A full runnable example is attached here: (6.16 KB)