Tiled Reduction Problem Only Works if Tile count is 1

I have a 2D Square Float Matrix in which each Block calculates the Maximum Value of its corresponding row using Reduction

Assumptions to work for this minimized kernel:

Dim is Equal to the Number of Columns and Rows and so as the Number of Blocks and is a power of 2 size.

Since we can only have 512 Threads per Block at Cuda 1.1. I can only compare 512 Elements

in my reduction.

So I want to Tile the Row into sub Parts.

Minimum value of the array is 0. (shared memory initializing)

The width of the tile must be a Power of 2 to make the reduction work.

This Code only calculates results correctly if the Tile Count is 1 which means the width of the tile is equal to the dimension of the matrix.

#define TILE_WIDTH 512 //Power of 2 and up to 2048 on Cuda 1.1

/*

 * @param mat float** Input Squared Float Matrix

 * @param widthTileCount int Iteration Count

 * @param maxArr float* Output Float Array holding Maximum Values for each row

 */

__global__ void cluster(float** mat,

	unsigned int widthTileCount,

	float* maxArr)

{

	//shared reduction array

	__shared__ float values[TILE_WIDTH];

	

	//shared maximum value of this block

	__shared__ float max;

	

	//initialize shared memory

	if(threadIdx.x == 0)

		max = 0.0f;

	__syncthreads();

	

	//column index

	unsigned int colIdx = 0;

	

	/*

	 * loop over the tiles requirered to load the tile of the row

	 * and compare and set the maximum shared value after the reduction

	 */

	for(unsigned int i = 0; i < widthTileCount; i++)

	{

		//get linear column index of this thread

		colIdx = i * TILE_WIDTH + threadIdx.x;

		

		//load values into shared memory

		values[threadIdx.x] = mat[blockIdx.x][colIdx];

		__syncthreads();

		

		/*

		 * Do Reduction

		 */

		for(unsigned int s = blockDim.x >> 1; s > 0; s >>= 1)

		{

			if(threadIdx.x < s)

				values[threadIdx.x] = __max(values[threadIdx.x], values[threadIdx.x + s]);

			__syncthreads();

		}

		

		//compare and set shared max value with the one of the current tile

		if(threadIdx.x == 0)

			max = __max(max, values[0]);

		__syncthreads();

	}

	

	//output data

	if(threadIdx.x == 0)

		maxArr[blockIdx.x] = max;

}
MaxArr:

is: 

0.985   0.991   0.932   0.948   0.992   0.926   0.704   0.893   0.970   0.941 

should be: 

0.985   0.991   0.972   0.993   0.992   0.977   0.957   0.973   0.970   0.941

I dont get it, i followed some steps on paper. In the Output are some values right and some wrong.

Would be thankfull for any hints.

Edit: Yeah if tileWidthCount is wrong results cannot be correct -.-