coalesced access to global memory

Dalibor_CZ · January 14, 2013, 2:40pm

Hallo,
in my program I have a class for complex numbers [in code section]. Is writing to and reading from the field of dcomplex objects in Kernel coalesced?

Simple kernel example with ?coalesced? reading from the field:

class dcomplex {
double r;
double i;
}

void Kernel(dcomplex * field){
__shared__ dcomplex sfield [32];

sfield[threadIdx.x]=field[threadIdx.x];
}

jgonzac · January 14, 2013, 4:00pm

In this case, each WARP has to read 512 bytes from global memory, so, it needs at least four transactions of 128 bytes (compute capability >= 2.0) in a coalesced way. So, you will have 25% global memory load efficiency due to your struct (16 bytes).
You can try to use two arrays, one for each component of your struct to improve the efficiency.

SPWorley · January 14, 2013, 5:13pm

Two points:

First, you can make the read and initialization coalesced without changing your data structure… just change your view of it. You’re not processing or reordering it when you put it into shared memory, so don’t think of it as reading and writing 32 contiguous 128 bit structs, think of it as reading and writing 128 contiguous 32 bit words. It’s the same thing, but the latter view makes it obvious that it’s just a 1:1 data copy and you can explicitly copy that way, perhaps with lines of code like:

((int *)sfield)[threadIdx.x   ]=((int *)field)[threadIdx.x   ];
((int *)sfield)[threadIdx.x+32]=((int *)field)[threadIdx.x+32];
((int *)sfield)[threadIdx.x+64]=((int *)field)[threadIdx.x+64];
((int *)sfield)[threadIdx.x+96]=((int *)field)[threadIdx.x+96];

This is for a toy example with only 32 threads in the warp… in general you’d have all warps copying in a for() loop with a __syncthreads() at the end.
This strategy is great because global memory reads are all coalesced, and shared memory accesses have no bank conflicts.

Second point:
Even if you WERE rearranging the order or the above flat copy wasn’t appropriate, it’s often no significant penalty to just lazily copy the data just as your code shows. Don’t get scared by “Oh no! 25% memory efficiency!”. That used to be the case before Fermi, but with Fermi and Kepler you have multiple fast data caches which really hide a lot of those minor data rearrangements. Hitting L1 cache is FAST compared to a global memory read, so sometimes it’s fine to be lazy and let the cache handle your mismatches… that saves your programming effort and lines of code for more important things. This advice of course has the standard caveat that it varies with your application… sometimes the memory copies are indeed a bottleneck and you SHOULD manually optimize them. If you’re still programming SM_13 or earlier then it is necessary.

Dalibor_CZ · January 15, 2013, 7:35am

Thanks for both replies,
now I have a better idea about the coalesced access to global memory. Last two stupid questions:

1 Does coalesced writing to global memory exist? - as a inverse process to coalesced global memory reading.

Is access to global memory coalesced in cases when the kernel is launched with less than 32 threads? Lets say for example 12?

jgonzac · January 15, 2013, 12:00pm

Of course it does. It works in the same way if all threads write to different addresses, if not, there could be memory conflicts.
Yes, it is, but you only have twelve 32-bit useful words per memory transaction.
Anyway, if you are launching a kernel with only 12 threads, you are not exploiting the capabilities of your graphic card.

jgonzac · January 15, 2013, 12:03pm

SPWorley:

Two points:

First, you can make the read and initialization coalesced without changing your data structure… just change your view of it. You’re not processing or reordering it when you put it into shared memory, so don’t think of it as reading and writing 32 contiguous 128 bit structs, think of it as reading and writing 128 contiguous 32 bit words. It’s the same thing, but the latter view makes it obvious that it’s just a 1:1 data copy and you can explicitly copy that way, perhaps with lines of code like:
((int *)sfield)[threadIdx.x   ]=((int *)field)[threadIdx.x   ];
((int *)sfield)[threadIdx.x+32]=((int *)field)[threadIdx.x+32];
((int *)sfield)[threadIdx.x+64]=((int *)field)[threadIdx.x+64];
((int *)sfield)[threadIdx.x+96]=((int *)field)[threadIdx.x+96];
This is for a toy example with only 32 threads in the warp… in general you’d have all warps copying in a for() loop with a __syncthreads() at the end.
This strategy is great because global memory reads are all coalesced, and shared memory accesses have no bank conflicts.

Good idea, I still have much to learn. Thanks!

ggeo · May 8, 2014, 9:45am

((int *)sfield)[threadIdx.x ]=((int *)field)[threadIdx.x ];
    ((int *)sfield)[threadIdx.x+32]=((int *)field)[threadIdx.x+32];
    ((int *)sfield)[threadIdx.x+64]=((int *)field)[threadIdx.x+64];
    ((int *)sfield)[threadIdx.x+96]=((int *)field)[threadIdx.x+96];

Nice,thanks!

A question though.

If instead of threadIdx.x for the field variable I had

ind = ( threadIdx.y + blockIdx.y * blockDim.y ) * NumberOfCols + ( threadIdx.x + blockIdx.x * blockDim.x )

Would still be applied the above code?

((int *)sfield)[threadIdx.x ]=((int *)field)[ind ];
((int *)sfield)[threadIdx.x+32]=((int *)field)[ind +32];

I think no!Is that right?

Because ind will take all it’s values concurrently?

(If you can help me with Coalesced shared memory access? Read and write from which thread to which? - CUDA Programming and Performance - NVIDIA Developer Forums
I 'll appreciate)

Thanks!

PS: I noticed that this is an old thread!