Algorithm Cascading

mediocre · June 12, 2010, 7:37pm

I am trying to learn about parallel reduction through the example in SDK. The white paper for parallel reduction explains 7 steps to optimize the process. I can understand the first six optimization procedure, but i cant understand the part on algorithm cascading.

Following is the code provided in the white paper, slide 32:
unsigned int tid=threadIdx.x;
unsigned int i=blockIdx.x2blockDim.x+threadIdx.x;
unsigned int gridSize=blockSize2gridDim.x;
sdata[tid]=0;
while(i<n){
sdata[tid]=x[i]+x[i+blockDim.x];
i+=gridSize}
__syncthreads();

Is the gridSize here half of the length of the vector to be reduced? And what does n refers to? (in while(i<n))

Thanks.

avidday · June 12, 2010, 7:54pm

n will be the length of the vector and gridSize is the total number of threads in the grid, which is used as the stride in the read of the vector from global memory.

mediocre · June 12, 2010, 8:44pm

I have implemented the parallel reduction algorithm. The result were correct, but the performance didn’t seem to be that great. (Slightly slower than serial version) I guess i have done something fundamentally wrong, but I am not sure where.

The test array size i use is 1048576. (1024*1024).

My block size is 256, and this is how i called the kernel:
blockno=length/BLOCK_SIZE;
reduction0<<<blockno/16,BLOCK_SIZE>>>(dx,dpsum,length);

I allocate the memory to the device by:
cudaMalloc((void**)&dx,lengthsizeof(float));
cudaMalloc((void**)&dpsum,blocknosizeof(float)/16);

The following is my kernel:
global void reduction0(floatx,floatpsum,int n)
{
shared float sdata[BLOCK_SIZE];

unsigned int tid=threadIdx.x;
unsigned int i=blockIdx.x*2*blockDim.x+threadIdx.x;
unsigned int gridSize=BLOCK_SIZE*2*gridDim.x;
sdata[tid]=0;
while(i<n){
sdata[tid]+=x[i]+x[i+blockDim.x];
i+=gridSize;}
__syncthreads();

for(unsigned int stride=blockDim.x>>1;stride>32;stride>>=1)
{
    if(tid<stride)
    {
    sdata[tid]+=sdata[tid+stride];
    }
__syncthreads();
}

if(tid<32)
{
    sdata[tid]+=sdata[tid+32];
    sdata[tid]+=sdata[tid+16];
    sdata[tid]+=sdata[tid+8];
    sdata[tid]+=sdata[tid+4];
    sdata[tid]+=sdata[tid+2];
    sdata[tid]+=sdata[tid+1];
 }

avidday · June 12, 2010, 10:04pm

You are probably launching far too many blocks. Try something like 2 blocks per MP and see how it performs.

mediocre · June 13, 2010, 3:53am

I have tried

reduction0<<<blockno/128,BLOCK_SIZE>>>(dx,dpsum,length); (BLOCK_SIZE=128)

and

reduction0<<<blockno/32,BLOCK_SIZE>>>(dx,dpsum,length); (BLOCK_SIZE=512)

But there is still no significant improvement. Still slower than the serial version. My graphics card is GTX 285.

Could compiler be an issue? I am using nvmex in Matlab to compile, and i did not specify the computing capability to 1.3.

NGeo · March 30, 2013, 1:21pm

unsigned int gridSize = blockSize2gridDim.x;

I think its the number of total blocks which they contain the threads and not not total number of threads.
I also find it hard to understand the cascading.

Topic		Replies	Views
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2302	August 12, 2009
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2195	November 23, 2009
Vector Reduction CUDA Programming and Performance	3	19660	March 9, 2011
I want to ask parallel reduction.... CUDA Programming and Performance	0	483	January 30, 2019
Optimization suggestions for reading from main memory to registers and share memory CUDA Programming and Performance	10	167	May 22, 2024
NVIDIA SDK Example on Reduction Reduction Ver 1: from the Whitepaper on Reductions CUDA Programming and Performance	3	3369	July 23, 2009
Where is the bottleneck in my parallel max reduction code? CUDA Programming and Performance	2	6705	September 5, 2011
Hybrid Atomic Reduction CUDA Programming and Performance	0	661	June 24, 2013
Correlation on GPU CUDA Programming and Performance	7	2005	October 23, 2010
[beginner] memory access CUDA Programming and Performance	7	11289	November 3, 2010

Algorithm Cascading

Related topics