I have been working on a problem which required summation of rows followed by summation of the columns of a 2D array (matrix). I noticed the column summation was faster than the row summation, which to me goes against what I learned about memory access coalescence.

I extracted the summations to test this in isolation and found that column summations were about twice as fast as the row summations. I would have expected this to be the other way around since the row summations read contiguous data. I am obviously missing something, or being completely blind, and was wondering if anyone can see the reason for this. Below are the kernels I tested.

The first sums the value 1.0 across the rows:

```
static __global__ void dRowSum(float* d_row_sum) {
// get indecies
unsigned int s_tx = threadIdx.x;
unsigned int s_ty = threadIdx.y;
unsigned int d_ty = BLOCKSIZE * blockIdx.y + threadIdx.y;
// Create shared mem CP
__shared__ float s_C[BLOCKSIZE][BLOCKSIZE];
// @@ ... ms
s_C[s_ty][s_tx] = 1.0;
__syncthreads();
// Sum reduce each column of C
// @@ ... ms
for (unsigned int stride = blockDim.x / 2; stride >= 1; stride >>= 1) {
__syncthreads();
if (s_tx < stride) {
s_C[s_ty][s_tx] += s_C[s_ty][s_tx + stride];
}
}
__syncthreads();
// Save col 0 to global mem
// @@ ... ms
if (s_tx == 0) {
d_row_sum[d_ty] = s_C[s_ty][0];
}
}
```

The second sums the value 1.0 across the columns:

```
static __global__ void dColSum(float* d_col_sum) {
// get indecies
unsigned int s_tx = threadIdx.x;
unsigned int d_tx = BLOCKSIZE * blockIdx.x + threadIdx.x;
unsigned int s_ty = threadIdx.y;
// Create shared mem NFC
__shared__ float s_C[BLOCKSIZE][BLOCKSIZE];
// @@ ... ms
s_C[s_ty][s_tx] = 1.0;
__syncthreads();
// Sum reduce each column of C
// @@ ... ms
for (unsigned int stride = blockDim.y / 2; stride >= 1; stride >>= 1) {
__syncthreads();
if (s_ty < stride) {
s_C[s_ty][s_tx] += s_C[s_ty + stride][s_tx];
}
}
__syncthreads();
// Save row 0 to global mem
// @@ ... ms
if (s_ty == 0) {
d_col_sum[d_tx] = s_C[0][s_tx];
}
}
```

For a 500x500 block test on a Quadro FX1800M (compute capability 1.2), I got these results:

Grid setup 500 x 500 blocks :: 16 * 16 threads

Running: !!

Row Sum Time : 1018.276978

Col Sum Time : 390.657013

But whyyyyy???

Any pointers in the right direction would be greatly appreciated.