Hello resident gurus,
I am trying to understand and modify the parallel reduction example. As I understand it, the only reference to original data is when a sum is copied into shared memory, so I wanted to change this to generate a simple count. Here’s my modification of the reduce6 kernel code:
…
// we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridDim). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
mySum += 1; //g_idata[i]; -- modified
// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
if (nIsPow2 || i + blockSize < n)
mySum += 1; //g_idata[i+blockSize]; -- modified
i += gridSize;
}
// each thread puts its local sum into shared memory
sdata[tid] = mySum;
__syncthreads();
...
So I’m just putting sums of 1’s into shared memory instead of sums of the elements themselves (I’m going to count conditional flags in my real version), and relying on the existing reduction to add them up. Unfortunately, in practice the 16M data points reduce to a total of 64, the same as the number of blocks. This is on Ubuntu 10.10, GTX470. What’s going on?
Thanks,
CRF