Misaligned starting address for memory coalescing

gepolv · March 28, 2011, 10:50pm

One rule for memory coalescing (compute capability 1.0 and 1.1) is

“The address of the first element is aligned to 16 times the elementâ€™s size”

The problem is how we can detect misaligned starting address for memory coalescing since the address is not explicitly there.

For example, in a function [credit to kaberdude]:

__device__ int forward[32] = { 0,

 1,  2,  3,  4,  5,  6,  7,  8,  9, 10,

11, 12, 13, 14, 15, 16, 17, 18, 19, 20,

21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__global__ void coalesce(int * data, int n, int iter)

{

    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int sf[32];

    sf[idx] = forward[idx];

    __syncthreads();

    for (int i = 0; i < iter; ++i)

        data[sf[idx]] += n;

}

How can we know whether “The address of the first element is aligned to times of 64=16*4”?

Many thanks.

Jinpeng

hyqneuron · March 29, 2011, 6:09am

Under your assumption… i don’t know what you are doing with that array called sf. And that sync seems unnecessary too.

Anyway, global memory pointers allocated using cudaMalloc() is guaranteed to be at least 128-byte aligned. So no problem in your code.

But I think you have not given the complete code. Why don’t you just do data[threadIdx.x] = n * iter;? If the way you access data is not always with the index threadIdx.x, then remember the index has to take an increment of multiples of 32 in the same thread(possibly within the loop?) so that after the offset it will still be 128-byte aligned.

gepolv · March 30, 2011, 8:23pm

sf is used to store index. With sf, data[sf[idx]] means thread “idx” will access the "sf[idx]"th element in global memory (because data is allocated in global memory). If I understand correctly, data[thread.x+1] will generate a misaligned memory accessing, right?

Thanks

Deryk

hyqneuron · March 31, 2011, 2:03am

Yes, this thing, data[thread.x+1], will generate misaligned access. But what is the point of doing this?
If you must use thread with idx.x = x0 to access data[x0+1], then you could consider using shared memory to reorder your access. Otherwise just write with an increment of 32(or a minimum of 16 for CC1.x). Like this: data[threadIdx.x + i*32]

For using shared memory to reorder access, here’s how it’s like:
global void coalesce(int * data, int n, int iter)
{
// assume one block of size 32.
int idx = threadIdx.x;
shared int sf; //I’ll just use it for my own purpose here… I still don’t understand why you use this to store the index. Why not use idx directly?
// also, the size should not be 32. It should be 31+iter. You can change this in launch configuration
//I’m assuming you’ve already loaded values in data into sf
for (int i = 0; i < iter; ++i)
sf[idx+i] += n; //guaranteed to have no bank conflicts. And misalignment in has no effect on shared memory
int max = (15+iter)/32; //here 15 is used because I’m assuming the division by 32 is rounded to nearest integer. So that the moment iter goes beyond zero, the (15+iter)/32 will go beyond 0 as well
int total = 31 + iter; //total number of words to be written
for (int i=0; i< max; i++) //max is the number of 128-byte memory transactions to global memory
{
int accindex = i32+idx; //so here, access to global memory is with an increment of 32 words
if(accindex < total)
data[accindex] = sf[accindex];}
}
//there’re probably some errors in the indices I used… hope you can get the idea
}

gepolv · March 31, 2011, 3:23am

That is actually not a serious code, just used to test performance with or without memory coalescing. Because there are few examples to explain this concept, I have to design examples. By the way, do you have any examples (sophisticate ones) to illustrate the insights of memory coalescing. Mine is too trivial.

Anyway, now I am clear about this concept.

Thanks a lot.

Deryk

Yes, this thing, data[thread.x+1], will generate misaligned access. But what is the point of doing this?

If you must use thread with idx.x = x0 to access data[x0+1], then you could consider using shared memory to reorder your access. Otherwise just write with an increment of 32(or a minimum of 16 for CC1.x). Like this: data[threadIdx.x + i*32]

For using shared memory to reorder access, here’s how it’s like:

global void coalesce(int * data, int n, int iter)

{
// assume one block of size 32.

int idx = threadIdx.x;

__shared__ int *sf; //I'll just use it for my own purpose here.. I still don't understand why you use this to store the index. Why not use idx directly? 

                    // also, the size should not be 32. It should be 31+iter. You can change this in launch configuration

//I'm assuming you've already loaded values in data into sf

for (int i = 0; i < iter; ++i)

    sf[idx+i] += n; //guaranteed to have no bank conflicts. And misalignment in has no effect on shared memory

int max = (15+iter)/32; //here 15 is used because I'm assuming the division by 32 is rounded to nearest integer. So that the moment iter goes beyond zero, the (15+iter)/32 will go beyond 0 as well

int total = 31 + iter; //total number of words to be written

for (int i=0; i< max; i++) //max is the number of 128-byte memory transactions to global memory

{

    int accindex = i*32+idx; //so here, access to global memory is with an increment of 32 words

    if(accindex < total)

            data[accindex] = sf[accindex];}

}
//there’re probably some errors in the indices I used… hope you can get the idea

}

Topic		Replies	Views
256B aligned address in global memory? CUDA Programming and Performance	1	6638	April 19, 2011
Global memory alignment and coalescing CUDA 1.1 compatible CUDA Programming and Performance	2	1748	October 20, 2008
Coalescing - beginner question CUDA Programming and Performance	10	1869	June 23, 2010
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1224	March 30, 2009
Memory coalescing CUDA Programming and Performance	0	8414	June 10, 2007
gld coalesced = 0, but addresses are aligned! CUDA Programming and Performance	10	1738	March 20, 2010
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11962	March 20, 2009
Coalesced Memory access related doubt CUDA Programming and Performance	13	2204	December 9, 2010
Loading global memory into shared memory: alignment? CUDA Programming and Performance	2	902	December 8, 2017
Memory coalescing in one thread CUDA Programming and Performance	17	16769	March 31, 2011

Misaligned starting address for memory coalescing

Related topics