Algorithm Cascading

I am trying to learn about parallel reduction through the example in SDK. The white paper for parallel reduction explains 7 steps to optimize the process. I can understand the first six optimization procedure, but i cant understand the part on algorithm cascading.

Following is the code provided in the white paper, slide 32:
unsigned int tid=threadIdx.x;
unsigned int i=blockIdx.x2blockDim.x+threadIdx.x;
unsigned int gridSize=blockSize2gridDim.x;
sdata[tid]=0;
while(i<n){
sdata[tid]=x[i]+x[i+blockDim.x];
i+=gridSize}
__syncthreads();

Is the gridSize here half of the length of the vector to be reduced? And what does n refers to? (in while(i<n))

Thanks.

n will be the length of the vector and gridSize is the total number of threads in the grid, which is used as the stride in the read of the vector from global memory.

I have implemented the parallel reduction algorithm. The result were correct, but the performance didn’t seem to be that great. (Slightly slower than serial version) I guess i have done something fundamentally wrong, but I am not sure where.

The test array size i use is 1048576. (1024*1024).

My block size is 256, and this is how i called the kernel:
blockno=length/BLOCK_SIZE;
reduction0<<<blockno/16,BLOCK_SIZE>>>(dx,dpsum,length);

I allocate the memory to the device by:
cudaMalloc((void**)&dx,lengthsizeof(float));
cudaMalloc((void**)&dpsum,blockno
sizeof(float)/16);

The following is my kernel:
global void reduction0(floatx,floatpsum,int n)
{
shared float sdata[BLOCK_SIZE];

unsigned int tid=threadIdx.x;
unsigned int i=blockIdx.x*2*blockDim.x+threadIdx.x;
unsigned int gridSize=BLOCK_SIZE*2*gridDim.x;
sdata[tid]=0;
while(i<n){
sdata[tid]+=x[i]+x[i+blockDim.x];
i+=gridSize;}
__syncthreads();

for(unsigned int stride=blockDim.x>>1;stride>32;stride>>=1)
{
    if(tid<stride)
    {
    sdata[tid]+=sdata[tid+stride];
    }
__syncthreads();
}

if(tid<32)
{
    sdata[tid]+=sdata[tid+32];
    sdata[tid]+=sdata[tid+16];
    sdata[tid]+=sdata[tid+8];
    sdata[tid]+=sdata[tid+4];
    sdata[tid]+=sdata[tid+2];
    sdata[tid]+=sdata[tid+1];
 }

You are probably launching far too many blocks. Try something like 2 blocks per MP and see how it performs.

I have tried

reduction0<<<blockno/128,BLOCK_SIZE>>>(dx,dpsum,length); (BLOCK_SIZE=128)

and

reduction0<<<blockno/32,BLOCK_SIZE>>>(dx,dpsum,length); (BLOCK_SIZE=512)

But there is still no significant improvement. Still slower than the serial version. My graphics card is GTX 285.

Could compiler be an issue? I am using nvmex in Matlab to compile, and i did not specify the computing capability to 1.3.

unsigned int gridSize = blockSize2gridDim.x;

I think its the number of total blocks which they contain the threads and not not total number of threads.
I also find it hard to understand the cascading.