Hello,
I was looking at the reduction implementation in CUDA SDK for the final optimized kerenel, and i just dont understand why there is an i=+gridSize sum in the while loop.
i.e: if i have a n=512 elements long 1D array, which was diveded to 8 blocks of 64 threads, then gridSize=2328=512 and i+=512 will never be smaller than n(n=512).
can you please advise?
this is the code:
template <class T, unsigned int blockSize, bool nIsPow2>
global void
reduce6(T *g_idata, T *g_odata, unsigned int n)
{
T *sdata = SharedMemory();
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
T mySum = 0;
// we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridDim). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
mySum += g_idata[i];
// ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
if (nIsPow2 || i + blockSize < n)
mySum += g_idata[i+blockSize];
i += gridSize; }