I am trying to learn about parallel reduction through the example in SDK. The white paper for parallel reduction explains 7 steps to optimize the process. I can understand the first six optimization procedure, but i cant understand the part on algorithm cascading.
Following is the code provided in the white paper, slide 32:
unsigned int tid=threadIdx.x;
unsigned int i=blockIdx.x2blockDim.x+threadIdx.x;
unsigned int gridSize=blockSize2gridDim.x;
sdata[tid]=0;
while(i<n){
sdata[tid]=x[i]+x[i+blockDim.x];
i+=gridSize}
__syncthreads();
Is the gridSize here half of the length of the vector to be reduced? And what does n refers to? (in while(i<n))
n will be the length of the vector and gridSize is the total number of threads in the grid, which is used as the stride in the read of the vector from global memory.
I have implemented the parallel reduction algorithm. The result were correct, but the performance didn’t seem to be that great. (Slightly slower than serial version) I guess i have done something fundamentally wrong, but I am not sure where.
The test array size i use is 1048576. (1024*1024).
My block size is 256, and this is how i called the kernel:
blockno=length/BLOCK_SIZE;
reduction0<<<blockno/16,BLOCK_SIZE>>>(dx,dpsum,length);
I allocate the memory to the device by:
cudaMalloc((void**)&dx,lengthsizeof(float));
cudaMalloc((void**)&dpsum,blocknosizeof(float)/16);
The following is my kernel: global void reduction0(floatx,floatpsum,int n)
{ shared float sdata[BLOCK_SIZE];
unsigned int tid=threadIdx.x;
unsigned int i=blockIdx.x*2*blockDim.x+threadIdx.x;
unsigned int gridSize=BLOCK_SIZE*2*gridDim.x;
sdata[tid]=0;
while(i<n){
sdata[tid]+=x[i]+x[i+blockDim.x];
i+=gridSize;}
__syncthreads();
for(unsigned int stride=blockDim.x>>1;stride>32;stride>>=1)
{
if(tid<stride)
{
sdata[tid]+=sdata[tid+stride];
}
__syncthreads();
}
if(tid<32)
{
sdata[tid]+=sdata[tid+32];
sdata[tid]+=sdata[tid+16];
sdata[tid]+=sdata[tid+8];
sdata[tid]+=sdata[tid+4];
sdata[tid]+=sdata[tid+2];
sdata[tid]+=sdata[tid+1];
}
I think its the number of total blocks which they contain the threads and not not total number of threads.
I also find it hard to understand the cascading.