Misaligned starting address for memory coalescing

One rule for memory coalescing (compute capability 1.0 and 1.1) is

“The address of the first element is aligned to 16 times the element’s size”

The problem is how we can detect misaligned starting address for memory coalescing since the address is not explicitly there.

For example, in a function [credit to kaberdude]:

__device__ int forward[32] = { 0,

 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,

11, 12, 13, 14, 15, 16, 17, 18, 19, 20,

21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__global__ void coalesce(int * data, int n, int iter)


    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int sf[32];

    sf[idx] = forward[idx];


    for (int i = 0; i < iter; ++i)

        data[sf[idx]] += n;


How can we know whether “The address of the first element is aligned to times of 64=16*4”?

Many thanks.


Under your assumption… i don’t know what you are doing with that array called sf. And that sync seems unnecessary too.

Anyway, global memory pointers allocated using cudaMalloc() is guaranteed to be at least 128-byte aligned. So no problem in your code.

But I think you have not given the complete code. Why don’t you just do data[threadIdx.x] = n * iter;? If the way you access data is not always with the index threadIdx.x, then remember the index has to take an increment of multiples of 32 in the same thread(possibly within the loop?) so that after the offset it will still be 128-byte aligned.

sf is used to store index. With sf, data[sf[idx]] means thread “idx” will access the "sf[idx]"th element in global memory (because data is allocated in global memory). If I understand correctly, data[thread.x+1] will generate a misaligned memory accessing, right?



Yes, this thing, data[thread.x+1], will generate misaligned access. But what is the point of doing this?
If you must use thread with idx.x = x0 to access data[x0+1], then you could consider using shared memory to reorder your access. Otherwise just write with an increment of 32(or a minimum of 16 for CC1.x). Like this: data[threadIdx.x + i*32]

For using shared memory to reorder access, here’s how it’s like:
global void coalesce(int * data, int n, int iter)
// assume one block of size 32.
int idx = threadIdx.x;
shared int sf; //I’ll just use it for my own purpose here… I still don’t understand why you use this to store the index. Why not use idx directly?
// also, the size should not be 32. It should be 31+iter. You can change this in launch configuration
//I’m assuming you’ve already loaded values in data into sf
for (int i = 0; i < iter; ++i)
sf[idx+i] += n; //guaranteed to have no bank conflicts. And misalignment in has no effect on shared memory
int max = (15+iter)/32; //here 15 is used because I’m assuming the division by 32 is rounded to nearest integer. So that the moment iter goes beyond zero, the (15+iter)/32 will go beyond 0 as well
int total = 31 + iter; //total number of words to be written
for (int i=0; i< max; i++) //max is the number of 128-byte memory transactions to global memory
int accindex = i
32+idx; //so here, access to global memory is with an increment of 32 words
if(accindex < total)
data[accindex] = sf[accindex];}
//there’re probably some errors in the indices I used… hope you can get the idea

That is actually not a serious code, just used to test performance with or without memory coalescing. Because there are few examples to explain this concept, I have to design examples. By the way, do you have any examples (sophisticate ones) to illustrate the insights of memory coalescing. Mine is too trivial.

Anyway, now I am clear about this concept.

Thanks a lot.