Hi All, I need some help here desperately.

I wrote the following kernel:

```
extern "C"
__global__ void sumRok(double* matrix, double *output, unsigned int rows, unsigned int cols){
__shared__ double tile[WARP_SIZE][2*WARP_SIZE+1];
int rIndex = blockIdx.x * BLOCK_SIZE + threadIdx.x;
int cIndex = blockIdx.y * WARP_SIZE + threadIdx.y;
tile[threadIdx.y][threadIdx.x] = 0;
tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;
bool alto = true;
if (cIndex < cols)
for (int i = 0; i < BLOCK_SIZE; i+=WARP_SIZE){
if (rIndex+i < rows) {
if (alto)
tile[threadIdx.y][threadIdx.x] += matrix[rIndex+i+rows*cIndex];
else tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] += matrix[rIndex+i+rows*cIndex];
alto = !alto;
} else break;
}
__syncthreads();
unsigned int bound = WARP_SIZE +1;
while (bound > 0){
tile[threadIdx.y][threadIdx.x] += tile[threadIdx.y][threadIdx.x+bound];
bound >>= 1;
}
if (threadIdx.x == 0) output[cIndex * REQ(rows, BLOCK_SIZE) + blockIdx.x] = tile[threadIdx.y][0];
}
```

The purpose of this kernel is to sum the rows of a matrix. There are a few things that I ought to explain:

in order to speed up, this kernel reads and adds (BLOCK_SIZE/WARP_SIZE) rows into shared memory of 2*WARP_SIZE tiles;

The shared memory is actually 2*WARP_SIZE+1 in width, with the added 1 acting as padding to avoid bank conflict;

The threads then reduce 2*WARP_SIZE doubles to a single value. Here no syncthreads is used because (I think) the threads are always synchronized in the same warp (each warp handles a row in the shared memory).

Since a lot of replies said that it is hard to understand the kernel, so here is some explanation:

suppose my warp size is 32, and I set the block size to 512. To run this kernel, I always use 32*32 blocks so that it matches the warp size.

first, a 32*65 shared memory is allocated

then each thread reads 16 entries of the matrix (32 rows apart – this read spans over 512 rows with 32 rows of threads) and add them to the shared memory. The 0th thread in the 0th row adds entries alternatively in the 0th space in the 0th row of shared memory and the 0+32+1 = 33th space. Notice the matrix is in column major format, hence to add values in the same column of the original, I need to add entries in the same row in the shared mem.

After all the read is done, I have a shared memory tile of 32*65, with the middle column acts as padding to prevent bank conflict. Each used entry in the shared memory is already a sum of 8 entries in the given matrix.

Now each warp performs reduction (i suppose no sync needed) to add 64 values into 1, stored in the first col of the shared mem. The reason 64 is chosen is to use every of the 32 threads in the first iteration of reduction (hence the alto adding scheme).

The first col of shared mem is now output.

I then tried to run the kernel with the following:

```
unsigned int row = 100;
unsigned int col = 20;
double* matrixHost;
matrixHost = (double*) malloc(row*col*sizeof(double));
double* matrixDevice;
cudaMalloc(&matrixDevice, row*col*sizeof(double));
for (int i = 0; i < row*col; i++)
matrixHost[i] = .01*i;
cudaMemcpy(matrixDevice, matrixHost, row*col*sizeof(double), cudaMemcpyHostToDevice);
double* resultHost;
resultHost = (double*)malloc(col*sizeof(double));
double* resultDevice;
cudaMalloc(&resultDevice, col*sizeof(double));
sumRok<<<REQ(col, WARP_SIZE), REQ(row, BLOCK_SIZE)>>>(matrixDevice, resultDevice, row, col);
cudaMemcpy(resultHost, resultDevice, col*sizeof(double), cudaMemcpyDeviceToHost);
printArray(resultHost, col);
int hold; cin >> hold;
```

It gives me junk. The compiler is used with arch=SM_20. win7 x64, sdk 3.2

What am I doing wrong here?

UPDATE: I took the advice and started using printf. it seems that the reading stage is fine, but then I added the following:

```
...
unsigned int bound = WARP_SIZE +1;
while (bound > 0){
tile[threadIdx.y][threadIdx.x] += tile[threadIdx.y][threadIdx.x+bound];
printf("bound=%d tile(%d,%d)=%f\n",bound, threadIdx.y, threadIdx.x,tile[threadIdx.y][threadIdx.x]);
bound >>= 1;
}
...
```

It gave me this:

bound=33 tile(0,0)=1.920000

bound=16 tile(-1,0)=131074.014116

bound=8 tile(-1,0)=-17592198672381.984000

bound=4 tile(-1,0)=-18691711096573.984000

bound=2 tile(-1,0)=-36283909900029.984000

bound=1 tile(-1,0)=149344859335460940000000.000000

-6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27

744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+

066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066,

-6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066,

Notice that tile(-1, 0). What the hell is going on?