Hi, I’m doing some experimentations to grasp how memory coherent accesses works and I’m stuck with the following problem which I can’t explain. Basically I’m doing some summations of row elements of a matrix just to define a simple memory access pattern. If each row is aligned to a 256 byte boundary (as returned by cudaMallocPitched), the time spent in the kernel is roughly 4x higher than if the matrix’s rows weren’t nicely aligned (as returned by cudaMalloc). Why is this so? It only happens with certain matrix widths.

First, the simple kernel:

```
typedef struct {
unsigned int width;
unsigned int height;
unsigned int pitch;
float* elements;
} Matrix;
__global__ void test_kernel(Matrix M,float *R)
{
int tx = threadIdx.x, ty = threadIdx.y,
bx = blockIdx.x, by = blockIdx.y;
int row = by*16 + ty, col = bx*16 + tx;
if(row >= M.height || col >= M.width)
return;
float sum = 0;
for(int i=0; i<M.width; ++i)
{
// I'm interested in the memory access pattern,
// not in the uselfulness of the calculation
sum += M.elements[row*M.pitch+i];
}
// I'm not too worried about race conditions here
R[tx] = sum;
}
```

I’m calling the kernel like this:

```
float *R;
cudaMalloc((void **)&R, 16*sizeof(float));
int rows = 629, cols = 418;
M = alloc_matrix(rows, cols);
dim3 dimGrid((M.width+dimBlock.x-1)/dimBlock.x,
(M.height+dimBlock.y-1)/dimBlock.y);
dim3 dimBlock(16,16);
test_kernel<<<dimGrid, dimBlock>>>(M,R);
```

Each warp is accessing the same word in both cases (linear and pitch-linear), creating a 32-byte transaction for each half-warp with 1/8 efficiency. Shouldn’t the access’ times be equal?

Just for the record, my GPU is a Geforce 275 (sm_13), I’m running linux with cuda 3.2RC and kernel 260.24.

Thanks in advance and regards,

Rodolfo Lima