Coalescence

luisgo · January 9, 2018, 10:42am

Dear all

Two cases:

global kernel(xxxxxx)
{

 float a[1024]; //goes to device memory

//offset 1…1024

b=a[OFFSET];

some non coalescent accesses (not all)
}

void main()
{

kernel<<<1,32>>> (xxxxxxx);

}

Second case

global kernel(float *a, xxxxxx)
{
int i=threadIdx.x;
float b;

b=a[threadIdx.x+32*OFFSET];

accesses to "a" in multiples of 32. Some coelescence exists

}

void main()
{
float *a;

cudaMalloc(a, 1024*32*sizeof(float)); //can be not the right sintaxe, see the meaning 

kernel<<<1,32>>> (a, xxxxxxx);

}

In first case the matrices are accessed individually and the second case not.

Now the question:

First case is faster than second case. Why? Does CUDA does automatically the acceeses coalescent as the second case?

Can anyone gives an explanation?

Thanks

Luis Gonçalves

Robert_Crovella · January 9, 2018, 3:40pm

There shouldn’t be much difference between the two cases.
Both will generate coalesced access.

The compiler arranges local storage such that access to a particular element in the array will generate coalesced access across a warp.

In the second case, of course, the coalesced access is obvious.

The compiler may have different strategies for optimization-into-registers for global vs. local access, or there may be other differences.

luisgo · January 9, 2018, 4:06pm

Sir

But suppose that in the first case the accesses are first a[1] then a[1000] and then a[500]. This in the 32 individual arrays.

In the second is:

a[1*32+threadIdx.x]  a[1000*32+threadIdx.x] and  a[500*32+threadIdx.x] with threadIdx.x in [0..31]

Do you mean that it is understandable? In which way?

Thanks

Luis Gonçalves

Robert_Crovella · January 9, 2018, 4:14pm

Sir

All of those will coalesce.

By that, I mean adjacent threads in the warp will be reading adjacent locations in physical DRAM.

For local memory, the compiler arranges underlying storage such that:

int a[some_constant];

int b = a[some_other_constant]; // this read operation on a will coalesce

For global memory, accesses that include threadIdx.x without any multiplicative factor will coalesce:

__global__ void kernel(int *a){

int b = a[threadIdx.x + any_expression_not_including_threadIdx.x]; // this read operation on a will coalesce

I’m assuming a 1D-in-x threadblock here, but there really isn’t any difference for other configurations.

Note that for any of the above items, the profiler may only report a maximum of 50% global load efficiency. This will depend on a number of factors, such as the GPU architecure, as well as if any of the above transactions cross cacheline boundaries or DRAM segment boundaries. But according to my definition:

By that, I mean adjacent threads in the warp will be reading adjacent locations in physical DRAM.

coalescence still occurs in all of the above cases, as all of the above cases will satisfy my statement.

Topic		Replies	Views
Need help on non-coalesced access CUDA Programming and Performance	0	1165	May 9, 2009
Do these two global memory coalesced access pattern have same performance in thoery? CUDA Programming and Performance cuda	3	376	December 17, 2022
coalesced access to global memory block-wise access vs element-wise access CUDA Programming and Performance	0	1540	March 17, 2010
Is these way coalesced access? CUDA Programming and Performance	0	420	March 6, 2020
Coalescing access CUDA Programming and Performance	3	802	March 2, 2012
Coalesced memory access in a matrix of coefficients CUDA Programming and Performance	5	466	August 15, 2024
Coalesced accesses on different arrays CUDA Programming and Performance	2	1813	November 10, 2009
Whether this is coalescing access several cases to decide CUDA Programming and Performance	0	1596	August 2, 2011
Accessing same global memory address within warps CUDA Programming and Performance	4	4404	October 24, 2018
Need some help to understand how to coalesce memory access CUDA Programming and Performance	4	1050	June 30, 2010

Coalescence

Related topics