Coalescence

Dear all

Two cases:

global kernel(xxxxxx)
{

 float a[1024]; //goes to device memory

//offset 1…1024

b=a[OFFSET];

some non coalescent accesses (not all)
}

void main()
{

kernel<<<1,32>>> (xxxxxxx);

}


Second case

global kernel(float *a, xxxxxx)
{
int i=threadIdx.x;
float b;

b=a[threadIdx.x+32*OFFSET];

accesses to "a" in multiples of 32. Some coelescence exists

}

void main()
{
float *a;

cudaMalloc(a, 1024*32*sizeof(float)); //can be not the right sintaxe, see the meaning 

kernel<<<1,32>>> (a, xxxxxxx);

}


In first case the matrices are accessed individually and the second case not.

Now the question:

First case is faster than second case. Why? Does CUDA does automatically the acceeses coalescent as the second case?

Can anyone gives an explanation?

Thanks

Luis Gonçalves

There shouldn’t be much difference between the two cases.
Both will generate coalesced access.

The compiler arranges local storage such that access to a particular element in the array will generate coalesced access across a warp.

In the second case, of course, the coalesced access is obvious.

The compiler may have different strategies for optimization-into-registers for global vs. local access, or there may be other differences.

Sir

But suppose that in the first case the accesses are first a[1] then a[1000] and then a[500]. This in the 32 individual arrays.

In the second is:

a[1*32+threadIdx.x]  a[1000*32+threadIdx.x] and  a[500*32+threadIdx.x] with threadIdx.x in [0..31]

Do you mean that it is understandable? In which way?

Thanks

Luis Gonçalves

Sir

All of those will coalesce.

By that, I mean adjacent threads in the warp will be reading adjacent locations in physical DRAM.

For local memory, the compiler arranges underlying storage such that:

int a[some_constant];

int b = a[some_other_constant]; // this read operation on a will coalesce

For global memory, accesses that include threadIdx.x without any multiplicative factor will coalesce:

__global__ void kernel(int *a){

int b = a[threadIdx.x + any_expression_not_including_threadIdx.x]; // this read operation on a will coalesce

I’m assuming a 1D-in-x threadblock here, but there really isn’t any difference for other configurations.

Note that for any of the above items, the profiler may only report a maximum of 50% global load efficiency. This will depend on a number of factors, such as the GPU architecure, as well as if any of the above transactions cross cacheline boundaries or DRAM segment boundaries. But according to my definition:

By that, I mean adjacent threads in the warp will be reading adjacent locations in physical DRAM.

coalescence still occurs in all of the above cases, as all of the above cases will satisfy my statement.