First, you can make the read and initialization coalesced without changing your data structure… just change your view of it. You’re not processing or reordering it when you put it into shared memory, so don’t think of it as reading and writing 32 contiguous 128 bit structs, think of it as reading and writing 128 contiguous 32 bit words. It’s the same thing, but the latter view makes it obvious that it’s just a 1:1 data copy and you can explicitly copy that way, perhaps with lines of code like:
((int *)sfield)[threadIdx.x ]=((int *)field)[threadIdx.x ];
((int *)sfield)[threadIdx.x+32]=((int *)field)[threadIdx.x+32];
((int *)sfield)[threadIdx.x+64]=((int *)field)[threadIdx.x+64];
((int *)sfield)[threadIdx.x+96]=((int *)field)[threadIdx.x+96];
This is for a toy example with only 32 threads in the warp… in general you’d have all warps copying in a for() loop with a __syncthreads() at the end.
This strategy is great because global memory reads are all coalesced, and shared memory accesses have no bank conflicts.
Even if you WERE rearranging the order or the above flat copy wasn’t appropriate, it’s often no significant penalty to just lazily copy the data just as your code shows. Don’t get scared by “Oh no! 25% memory efficiency!”. That used to be the case before Fermi, but with Fermi and Kepler you have multiple fast data caches which really hide a lot of those minor data rearrangements. Hitting L1 cache is FAST compared to a global memory read, so sometimes it’s fine to be lazy and let the cache handle your mismatches… that saves your programming effort and lines of code for more important things. This advice of course has the standard caveat that it varies with your application… sometimes the memory copies are indeed a bottleneck and you SHOULD manually optimize them. If you’re still programming SM_13 or earlier then it is necessary.