Reduction Problem

kelson · October 6, 2010, 12:18pm

Hi,

I’m trying to reduce a table of any size. This is how I programmed

device float FullReduction(float* _iarray,int row, int col)
{
//shared memory array
shared float sdata[BLOCK_SIZE];
int tid = threadIdx.x;
//int i = blockIdx.x *(blockDim.x) + threadIdx.x;
int size = col * row;

//each thread load data to shared memory
sdata[tid] = 0;
__syncthreads();
//total number of blocks which can contain the array 

int nblocks = (int)size / BLOCK_SIZE + ( (size % BLOCK_SIZE) == 0 ? 0:1);

//case where the number of elements in the array is higher than the number of threads in a single block 
if(nblocks < 2){
	if(tid < size)
		sdata[tid] = _iarray[tid];	
	__syncthreads();
}

//case where the array elements are dispatched in multiple threads block
else{

for (int k = 0; k < nblocks; k++){
	if((tid + k*BLOCK_SIZE) < size)	
		sdata[tid] += _iarray[tid + k * BLOCK_SIZE];
	__syncthreads();
}

}	
//loop over sdata
for(int j = BLOCK_SIZE / 2; j > 0; j>>=1)
{		
	if(tid < j){
		sdata[tid] += sdata[tid + j];
	}
	__syncthreads();
}
//write back the data to the output array 
//if(tid == 0 )
return sdata[0];

}

The result is ok when my array is dispatched into multiple blocks.
But when it is only one block the program doesn’t return what I expected.

I would be thankful to any suggestion.

Thank you

kelson · October 6, 2010, 12:18pm

Hi,

I’m trying to reduce a table of any size. This is how I programmed

device float FullReduction(float* _iarray,int row, int col)
{
//shared memory array
shared float sdata[BLOCK_SIZE];
int tid = threadIdx.x;
//int i = blockIdx.x *(blockDim.x) + threadIdx.x;
int size = col * row;

//each thread load data to shared memory
sdata[tid] = 0;
__syncthreads();
//total number of blocks which can contain the array 

int nblocks = (int)size / BLOCK_SIZE + ( (size % BLOCK_SIZE) == 0 ? 0:1);

//case where the number of elements in the array is higher than the number of threads in a single block 
if(nblocks < 2){
	if(tid < size)
		sdata[tid] = _iarray[tid];	
	__syncthreads();
}

//case where the array elements are dispatched in multiple threads block
else{

for (int k = 0; k < nblocks; k++){
	if((tid + k*BLOCK_SIZE) < size)	
		sdata[tid] += _iarray[tid + k * BLOCK_SIZE];
	__syncthreads();
}

}	
//loop over sdata
for(int j = BLOCK_SIZE / 2; j > 0; j>>=1)
{		
	if(tid < j){
		sdata[tid] += sdata[tid + j];
	}
	__syncthreads();
}
//write back the data to the output array 
//if(tid == 0 )
return sdata[0];

}

The result is ok when my array is dispatched into multiple blocks.
But when it is only one block the program doesn’t return what I expected.

I would be thankful to any suggestion.

Thank you

Tomasz_Rybak · October 12, 2010, 5:41pm

Quick comment - if you have Fermi, it has more aggresive optimisation for threads inside warp.
You need to use volatile for shared memory pointers.

For details read Fermi Compatibility Guide, Chapter 1.2.2

Tomasz_Rybak · October 12, 2010, 5:41pm

Quick comment - if you have Fermi, it has more aggresive optimisation for threads inside warp.
You need to use volatile for shared memory pointers.

For details read Fermi Compatibility Guide, Chapter 1.2.2

kelson · October 13, 2010, 4:50am

Thank you for you comments.

I’m not using Fermi.

I still don’t understand this problem

kelson · October 13, 2010, 4:50am

Thank you for you comments.

I’m not using Fermi.

I still don’t understand this problem

Topic		Replies	Views
NVIDIA SDK Example on Reduction Reduction Ver 1: from the Whitepaper on Reductions CUDA Programming and Performance	3	3397	July 23, 2009
Parallel reduction problem CUDA Programming and Performance	1	5098	November 29, 2010
reduction6 kernel from CUDA SDK not working correctly CUDA Programming and Performance	2	1844	August 10, 2010
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2244	November 23, 2009
Best way to face this problem CUDA Programming and Performance	4	1172	May 16, 2010
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1831	November 18, 2009
Reduction kernel for Fermi CUDA Programming and Performance	8	1659	June 11, 2010
Parallel Reduction CUDA Programming and Performance	2	1172	July 8, 2010
Reduction from SDK CUDA Programming and Performance	2	11613	March 12, 2009
explain me the vector reduction CUDA Programming and Performance	1	886	December 26, 2012

Reduction Problem

Related topics