One rule for memory coalescing (compute capability 1.0 and 1.1) is
“The address of the first element is aligned to 16 times the element’s size”
The problem is how we can detect misaligned starting address for memory coalescing since the address is not explicitly there.
For example, in a function [credit to kaberdude]:
__device__ int forward[32] = { 0,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};
__global__ void coalesce(int * data, int n, int iter)
{
// assume one block of size 32.
int idx = threadIdx.x;
__shared__ int sf[32];
sf[idx] = forward[idx];
__syncthreads();
for (int i = 0; i < iter; ++i)
data[sf[idx]] += n;
}
How can we know whether “The address of the first element is aligned to times of 64=16*4”?
Many thanks.
Jinpeng