Scan sample Program

kelson · October 1, 2010, 3:01pm

Hi,

I have a project I have to write a kernel to scan an array of any size:
1- size can be non power of 2
2- array can be dispatched in many threads blocks
3- size could be less than the threads in a single block

I wrote a programs which gives me results I wanted. But I’m not sure
this is the best "parallel efficient " way.

global void reduceKernel(float* _iarray, float* _oarray)
{
//shared memory array
shared float sdata[BLOCK_SIZE];
int tid = threadIdx.x;
int i = blockIdx.x *(blockDim.x *2 ) + threadIdx.x;

//each thread load data to shared memory
sdata[tid] = 0;
//total number of blocks which can contain the array 
int nblocks = (int)ARRAYSIZE / BLOCK_SIZE + ( (ARRAYSIZE % BLOCK_SIZE) == 0 ? 0:1);

//case where the number of elements in the array is higher than the number of threads in a single block 
if(ARRAYSIZE < BLOCK_SIZE){
	if(tid < ARRAYSIZE)
		sdata[i] = _iarray[i];	
}
//case where the array elements are dispatched in multiple threads blocks
else{
	for (int k = 0; k < nblocks; k++){
		if((i + k*blockDim.x) < ARRAYSIZE)	
			sdata[tid] += _iarray[i + k * blockDim.x];
	}
}	
//synchronization 
__syncthreads();
//loop over sdata
for(int j = blockDim.x/2; j > 0; j>>=1)
{		
	if(tid < j){
		sdata[tid] += sdata[tid + j];
	}
__syncthreads();
}
//write back the data to the output array 
if(tid == 0 )
	_oarray[0] = sdata[0];

}

Any comments will be welcome
Thank you

Topic		Replies	Views
Reduction Problem CUDA Programming and Performance	5	4736	October 13, 2010
Hybrid Atomic Reduction CUDA Programming and Performance	0	668	June 24, 2013
Call to _syncThreads() not needed? CUDA Programming and Performance	2	774	March 10, 2015
Reduction Problem CUDA Programming and Performance	1	3559	February 4, 2010
shared memory problems CUDA Programming and Performance	12	4988	October 2, 2010
choosing the best grid/block dimensions CUDA Programming and Performance	3	1111	January 30, 2016
reduction example in SDK CUDA Programming and Performance	1	4371	June 24, 2010
sharing global arrays slows kernel CUDA Programming and Performance	0	951	May 16, 2010
Shared memory vs global memory CUDA Programming and Performance	6	3451	April 30, 2007
Just give me an advice. write global CUDA Programming and Performance	1	1040	November 13, 2009

Scan sample Program

Related topics