Tiled Reduction Problem Only Works if Tile count is 1

djmj1000 · August 20, 2011, 1:32am

I have a 2D Square Float Matrix in which each Block calculates the Maximum Value of its corresponding row using Reduction

Assumptions to work for this minimized kernel:

Dim is Equal to the Number of Columns and Rows and so as the Number of Blocks and is a power of 2 size.

Since we can only have 512 Threads per Block at Cuda 1.1. I can only compare 512 Elements

in my reduction.

So I want to Tile the Row into sub Parts.

Minimum value of the array is 0. (shared memory initializing)

The width of the tile must be a Power of 2 to make the reduction work.

This Code only calculates results correctly if the Tile Count is 1 which means the width of the tile is equal to the dimension of the matrix.

#define TILE_WIDTH 512 //Power of 2 and up to 2048 on Cuda 1.1

/*

 * @param mat float** Input Squared Float Matrix

 * @param widthTileCount int Iteration Count

 * @param maxArr float* Output Float Array holding Maximum Values for each row

 */

__global__ void cluster(float** mat,

	unsigned int widthTileCount,

	float* maxArr)

{

	//shared reduction array

	__shared__ float values[TILE_WIDTH];

	

	//shared maximum value of this block

	__shared__ float max;

	

	//initialize shared memory

	if(threadIdx.x == 0)

		max = 0.0f;

	__syncthreads();

	

	//column index

	unsigned int colIdx = 0;

	

	/*

	 * loop over the tiles requirered to load the tile of the row

	 * and compare and set the maximum shared value after the reduction

	 */

	for(unsigned int i = 0; i < widthTileCount; i++)

	{

		//get linear column index of this thread

		colIdx = i * TILE_WIDTH + threadIdx.x;

		

		//load values into shared memory

		values[threadIdx.x] = mat[blockIdx.x][colIdx];

		__syncthreads();

		

		/*

		 * Do Reduction

		 */

		for(unsigned int s = blockDim.x >> 1; s > 0; s >>= 1)

		{

			if(threadIdx.x < s)

				values[threadIdx.x] = __max(values[threadIdx.x], values[threadIdx.x + s]);

			__syncthreads();

		}

		

		//compare and set shared max value with the one of the current tile

		if(threadIdx.x == 0)

			max = __max(max, values[0]);

		__syncthreads();

	}

	

	//output data

	if(threadIdx.x == 0)

		maxArr[blockIdx.x] = max;

}

MaxArr:

is: 

0.985   0.991   0.932   0.948   0.992   0.926   0.704   0.893   0.970   0.941 

should be: 

0.985   0.991   0.972   0.993   0.992   0.977   0.957   0.973   0.970   0.941

I dont get it, i followed some steps on paper. In the Output are some values right and some wrong.

Would be thankfull for any hints.

Edit: Yeah if tileWidthCount is wrong results cannot be correct -.-

Topic		Replies	Views
Cuda : Reduce (max/min) function on matrix implementation CUDA Programming and Performance	1	1664	August 22, 2019
Parallel reduction problem CUDA Programming and Performance	1	5083	November 29, 2010
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2230	November 23, 2009
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1814	November 18, 2009
Reduction Problem CUDA Programming and Performance	1	3557	February 4, 2010
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18324	October 17, 2013
Example of matrix multiplication (max. block_size) CUDA Programming and Performance	2	11601	January 28, 2010
CUDA reduction CUDA Programming and Performance	10	51435	June 7, 2009
Min Max problem in parallel CUDA Programming and Performance	2	1589	September 25, 2008
Matrix multiplication CUDA Programming and Performance	7	2161	July 2, 2010

Tiled Reduction Problem Only Works if Tile count is 1

Related topics