By that, I mean adjacent threads in the warp will be reading adjacent locations in physical DRAM.
For local memory, the compiler arranges underlying storage such that:
int a[some_constant];
int b = a[some_other_constant]; // this read operation on a will coalesce
For global memory, accesses that include threadIdx.x without any multiplicative factor will coalesce:
__global__ void kernel(int *a){
int b = a[threadIdx.x + any_expression_not_including_threadIdx.x]; // this read operation on a will coalesce
I’m assuming a 1D-in-x threadblock here, but there really isn’t any difference for other configurations.
Note that for any of the above items, the profiler may only report a maximum of 50% global load efficiency. This will depend on a number of factors, such as the GPU architecure, as well as if any of the above transactions cross cacheline boundaries or DRAM segment boundaries. But according to my definition:
By that, I mean adjacent threads in the warp will be reading adjacent locations in physical DRAM.
coalescence still occurs in all of the above cases, as all of the above cases will satisfy my statement.