I’ve been running a simple test to help me understand how alignment affects global memory access:
__global__ void align_kernel(float* idata, float* odata, int size, int dealign) {
int tid = threadIdx.x;
int bid = blockIdx.x;
int bsize = blockDim.x;
int gsize = gridDim.x;
for(int i=bid; i*bsize<size; i+=gsize) {
odata[dealign + i*bsize + tid] = idata[dealign + i*bsize + tid];
}
}
It is called as 32 blocks of 384 threads. size is a multiple of 384 (around 50000000). Here are some of the results for different values of dealign:
dealign..time (ms)
0..........5.703105
8..........51.920609
16........8.636962
24........51.927345
32........5.709507
I expected an offset of 16 to be properly aligned and perform as well as 0 or 32, and I don’t understand the penalty there. Any help?
-Jeff