AoS (Array of Structure) is faster because it compiles into fewer load instructions?
Why fewer load instructions make it faster?
Does one load of large data more costly than one load of small data?
If not, is loading small data, say 32-bit word for one thread an inefficient way that fails to sufficiently utilize the bandwidth (if 128-bit work can be read into registers in one instruction)?
I define a class, whose members are two doubles. Can I use align(16) for the class?
When I read from global memory and compute sigma = function(mpdata[i]), where mpdata is array of the class, will the program load the two members in one instruction?
If not, how can I set it to do?
By the way, how to get the maximum allowable align size like 128-bit as device property?
struct __align__(8) my_AoS {
int data1;
int data2;
};
works, am I right? If yes, does this belong to advice 1 or 2 you gave?
Is the “perfectly” coalesced access the best one? In other words, will “load multiple adjacent data elements in a structure simultaneously” work better?
A global load will generate a 32-byte or 128-byte read request when executed across a warp. The objective is to have all the bytes that are loaded be used by the threads that are participating in the load. That is perfect coalescing. It does not matter if each thread is loading an int (32 x 4 bytes = one 128 byte transaction) or if each thread is loading a int4 (32 x 16 bytes = four 128 byte transactions). All of the data that gets loaded gets used by a thread that is participating in that transaction. (on newer devices, the usage order can be scrambled across threads in the warp, as well)
The align(8) is probably superfluous. In typical usage that structure will already be aligned on a 8 byte boundary, even when used in an array. Since that structure contains no union, it could not possibly belong to advice 1 that I gave. But you could use advice 2, like:
my_AoS my_data[N];
int2 my_packed_thread_data = *((int2 *)(my_data + threadIdx.x)); // trigger 8-byte load per thread
int my_thread_data1 = my_packed_thread_data.x; // register variables
int my_thread_data2 = my_packed_thread_data.y; // register variables
the 8-byte load will combine with other 8-byte loads in adjacent threads to perfectly coalesce across the warp
You could also put the int2 type in the structure:
struct my_AoS {
int2 data; // and refer to the .x and .y components for data1 and data2
};
The structure you are showing now has a total of 20 bytes natively. So the align directive will cause that structure to start on a proper 16-byte boundary, so that you could load some of the elements with a vector-type load. There may be some utility in that, but I would strongly consider re-organizing an array of such structures into the preferred structure-of-arrays format instead. This statement doesn’t make sense:
“compiles into two 128-bit load instructions instead of five 32-bit load instructions.”
The only thing being shown is a structure definition. We cannot determine how it will load until we see the actual instructions used to load it. It could compile into two 128-bit load instructions if you use something like the int4 load approach I’ve already discussed. But that still doesn’t adequately explain the intent for a 20-byte structure (two 128-bit loads is 32 bytes total. So what is the intent?)
align() cannot make “the traditional 4 128-Byte loading into one”
Nothing can do that. The question itself suggests a lack of understanding, which is why I suggest reviewing the optimization material I linked.
The only thing the align directive can do is help with vector-type loads as I’ve already indicated, and possibly reduce a warp load from two cachelines to one, in some cases, but that is simply moving a problem from one place to another in your code.
Trying to learn basics of CUDA programming this way isn’t very efficient. I’d suggest taking advantage of some of the educational resources such as the webinars here: