Detect memory coalescing from SASS file

Robert_Crovella · January 6, 2023, 11:15pm

The method is the same as what I covered in your previous question.

Identify a SASS LD instruction of interest
Identify the register that contains the address to load from
Determine the contents of that register across the warp, using all operands and components that are used to assemble the quantity in that register, for each thread in the warp.

This isn’t the easier way to do it. The easier way is to use the C++ source code. I’ve covered an example of that as well in your last question (in this respect, the methodology is very similar between bank conflicts and coalescing). If you’re going to delve into this in detail, you may want to learn the basics of coalescing. I cover it in this training series, section 4.

For the bodyForce kernel you have shown, none of the loads (or stores) are perfectly coalesced, i.e. 100% memory utilization efficiency. A basic understanding of data storage patterns and warp-wide access patterns will immediately uncover that with a very brief perusal of the C++ source code.

The most evident reason for the lack of coalescing is the use of Array of Structures (AoS) data storage pattern, and having each thread access specific elements of the structure. This creates a pattern where adjacent threads are not accessing adjacent memory location, due to the intervening structure elements and storage pattern. For a structure with elements .x, .y and .z, it looks like this:

structure storage:  x y z x y z x y z x y z x y z  ...
accessing .x:       |     |     |     |     |      ...

The gaps in the adjacency of the access pattern above (corresponding to .y and .z for that example) result in an uncoalesced access.

When we are accessing all elements anyway per thread (eventually), a possible method to make better access patterns is to do a “vector load” per thread (or “vector store”), but this is difficult or impossible for 3-element vectors/structures. Clarification: you can do an “apparent” vector load or store at the C++ source code level, but that will not be translated by the compiler into a single instruction that loads the entire vector/struct, for the 3-element vector case.

Topic		Replies	Views
Coalesced Memory Access to Structs CUDA Programming and Performance	11	4623	September 19, 2009
Coalesced memory access example CUDA Programming and Performance	2	3256	March 28, 2011
Help improving performance CUDA Programming and Performance	19	14262	June 10, 2009
Uncoalesced global loads CUDA Programming and Performance	3	1483	April 29, 2022
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10169	June 28, 2009
Help with uncoalesced loads Data structure problem CUDA Programming and Performance	3	2641	October 8, 2009
Memory access should be coalesced but is not CUDA Programming and Performance	6	1062	May 16, 2019
How can I identify where coalescing can be done? CUDA Programming and Performance	7	5132	September 18, 2008
LDS.128 loads from shared memory CUDA Programming and Performance	3	586	September 11, 2023
Quick question about memory coalescence CUDA Programming and Performance	5	5672	May 5, 2008

Detect memory coalescing from SASS file

Related topics