The method is the same as what I covered in your previous question.
- Identify a SASS LD instruction of interest
- Identify the register that contains the address to load from
- Determine the contents of that register across the warp, using all operands and components that are used to assemble the quantity in that register, for each thread in the warp.
This isn’t the easier way to do it. The easier way is to use the C++ source code. I’ve covered an example of that as well in your last question (in this respect, the methodology is very similar between bank conflicts and coalescing). If you’re going to delve into this in detail, you may want to learn the basics of coalescing. I cover it in this training series, section 4.
For the bodyForce
kernel you have shown, none of the loads (or stores) are perfectly coalesced, i.e. 100% memory utilization efficiency. A basic understanding of data storage patterns and warp-wide access patterns will immediately uncover that with a very brief perusal of the C++ source code.
The most evident reason for the lack of coalescing is the use of Array of Structures (AoS) data storage pattern, and having each thread access specific elements of the structure. This creates a pattern where adjacent threads are not accessing adjacent memory location, due to the intervening structure elements and storage pattern. For a structure with elements .x
, .y
and .z
, it looks like this:
structure storage: x y z x y z x y z x y z x y z ...
accessing .x: | | | | | ...
The gaps in the adjacency of the access pattern above (corresponding to .y
and .z
for that example) result in an uncoalesced access.
When we are accessing all elements anyway per thread (eventually), a possible method to make better access patterns is to do a “vector load” per thread (or “vector store”), but this is difficult or impossible for 3-element vectors/structures. Clarification: you can do an “apparent” vector load or store at the C++ source code level, but that will not be translated by the compiler into a single instruction that loads the entire vector/struct, for the 3-element vector case.