reduction implementation in CUDA SDK

Hello,

I was looking at the reduction implementation in CUDA SDK for the final optimized kerenel, and i just dont understand why there is an i=+gridSize sum in the while loop.
i.e: if i have a n=512 elements long 1D array, which was diveded to 8 blocks of 64 threads, then gridSize=2328=512 and i+=512 will never be smaller than n(n=512).

can you please advise?

this is the code:

template <class T, unsigned int blockSize, bool nIsPow2>
global void
reduce6(T *g_idata, T *g_odata, unsigned int n)
{
T *sdata = SharedMemory();

// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;

T mySum = 0;

// we reduce multiple elements per thread.  The number is determined by the 
// number of active thread blocks (via gridDim).  More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{         
    mySum += g_idata[i];
    // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
    if (nIsPow2 || i + blockSize < n) 
        mySum += g_idata[i+blockSize];  

i += gridSize; }

check out its associated whitepaper - it says that the i += gridSize line is to have a pitched offset in order to maintain coalesced memory accesses in the loop.

edited for grammar + link