How alignment affects global mem access

I’ve been running a simple test to help me understand how alignment affects global memory access:

__global__ void align_kernel(float* idata, float* odata, int size, int dealign) {

	int tid = threadIdx.x;

	int bid = blockIdx.x;

	int bsize = blockDim.x;

	int gsize = gridDim.x;

	for(int i=bid; i*bsize<size; i+=gsize) {

  odata[dealign + i*bsize + tid] = idata[dealign + i*bsize + tid];



It is called as 32 blocks of 384 threads. size is a multiple of 384 (around 50000000). Here are some of the results for different values of dealign:

dealign..time (ms)






I expected an offset of 16 to be properly aligned and perform as well as 0 or 32, and I don’t understand the penalty there. Any help?