I have a 2D Square Float Matrix in which each Block calculates the Maximum Value of its corresponding row using Reduction

Assumptions to work for this minimized kernel:

Dim is Equal to the Number of Columns and Rows and so as the Number of Blocks and is a power of 2 size.

Since we can only have 512 Threads per Block at Cuda 1.1. I can only compare 512 Elements

in my reduction.

So I want to Tile the Row into sub Parts.

Minimum value of the array is 0. (shared memory initializing)

The width of the tile must be a Power of 2 to make the reduction work.

This Code only calculates results correctly if the Tile Count is 1 which means the width of the tile is equal to the dimension of the matrix.

```
#define TILE_WIDTH 512 //Power of 2 and up to 2048 on Cuda 1.1
/*
* @param mat float** Input Squared Float Matrix
* @param widthTileCount int Iteration Count
* @param maxArr float* Output Float Array holding Maximum Values for each row
*/
__global__ void cluster(float** mat,
unsigned int widthTileCount,
float* maxArr)
{
//shared reduction array
__shared__ float values[TILE_WIDTH];
//shared maximum value of this block
__shared__ float max;
//initialize shared memory
if(threadIdx.x == 0)
max = 0.0f;
__syncthreads();
//column index
unsigned int colIdx = 0;
/*
* loop over the tiles requirered to load the tile of the row
* and compare and set the maximum shared value after the reduction
*/
for(unsigned int i = 0; i < widthTileCount; i++)
{
//get linear column index of this thread
colIdx = i * TILE_WIDTH + threadIdx.x;
//load values into shared memory
values[threadIdx.x] = mat[blockIdx.x][colIdx];
__syncthreads();
/*
* Do Reduction
*/
for(unsigned int s = blockDim.x >> 1; s > 0; s >>= 1)
{
if(threadIdx.x < s)
values[threadIdx.x] = __max(values[threadIdx.x], values[threadIdx.x + s]);
__syncthreads();
}
//compare and set shared max value with the one of the current tile
if(threadIdx.x == 0)
max = __max(max, values[0]);
__syncthreads();
}
//output data
if(threadIdx.x == 0)
maxArr[blockIdx.x] = max;
}
```

```
MaxArr:
is:
0.985 0.991 0.932 0.948 0.992 0.926 0.704 0.893 0.970 0.941
should be:
0.985 0.991 0.972 0.993 0.992 0.977 0.957 0.973 0.970 0.941
```

I dont get it, i followed some steps on paper. In the Output are some values right and some wrong.

Would be thankfull for any hints.

Edit: Yeah if tileWidthCount is wrong results cannot be correct -.-