Why AoS faster in accessing global memory?

AoS (Array of Structure) is faster because it compiles into fewer load instructions?

Why fewer load instructions make it faster?

Does one load of large data more costly than one load of small data?

If not, is loading small data, say 32-bit word for one thread an inefficient way that fails to sufficiently utilize the bandwidth (if 128-bit work can be read into registers in one instruction)?

I define a class, whose members are two doubles. Can I use align(16) for the class?

When I read from global memory and compute sigma = function(mpdata[i]), where mpdata is array of the class, will the program load the two members in one instruction?

If not, how can I set it to do?

By the way, how to get the maximum allowable align size like 128-bit as device property?


The traditional advice is that SoA:

struct my_SoA {

int data1[N];
int data2[N];

my_SoA my_data;

is faster than AoS:

struct my_AoS {
int data1;
int data2;

my_AoS my_data[N];

because the SoA method makes it easier to produce coalesced access in kernel code:

my_SoA my_data1;

int my_thread_data1 = my_data1.data1[threadIdx.x]; // this load will coalesce perfectly

my_AoS my_data2[N];

int my_thread_data2 = my_data2[threadIdx.x].my_data2; //this will not coalesce perfectly

If you want to load multiple adjacent data elements in a structure simultaneously, you can either:

  1. Provide a union in the structure to combine the two, and load that, then unpack the union in your kernel code.

  2. Cast the array-of-struct pointer to an equivalent vector type that loads all the data you want, and then unpack that vector type.

Both methods require appropriate attention to data alignment and packing rules.

These methods will work up to 16bytes per thread, which is the maximum that can be loaded per thread in a single load (instruction).

I learnt from some a lecture that

struct __align__(8) my_AoS {
int data1;
int data2;

works, am I right? If yes, does this belong to advice 1 or 2 you gave?

Is the “perfectly” coalesced access the best one? In other words, will “load multiple adjacent data elements in a structure simultaneously” work better?

A global load will generate a 32-byte or 128-byte read request when executed across a warp. The objective is to have all the bytes that are loaded be used by the threads that are participating in the load. That is perfect coalescing. It does not matter if each thread is loading an int (32 x 4 bytes = one 128 byte transaction) or if each thread is loading a int4 (32 x 16 bytes = four 128 byte transactions). All of the data that gets loaded gets used by a thread that is participating in that transaction. (on newer devices, the usage order can be scrambled across threads in the warp, as well)

The align(8) is probably superfluous. In typical usage that structure will already be aligned on a 8 byte boundary, even when used in an array. Since that structure contains no union, it could not possibly belong to advice 1 that I gave. But you could use advice 2, like:

my_AoS my_data[N];

int2 my_packed_thread_data = *((int2 *)(my_data + threadIdx.x)); // trigger 8-byte load per thread
int my_thread_data1 = my_packed_thread_data.x; // register variables
int my_thread_data2 = my_packed_thread_data.y; // register variables

the 8-byte load will combine with other 8-byte loads in adjacent threads to perfectly coalesce across the warp

You could also put the int2 type in the structure:

struct my_AoS {
int2 data; // and refer to the .x and .y components for data1 and data2

A union approach would be something like:

struct inter {
int data1;
int data2;

struct my_AoS {
inter idata;
int2 vdata;
} x;

Then you could reference

int temp = my_data.x.idata.data1;

int temp = my_data.x.vdata.x;

And if you want to do a 8-byte load:

int2 temp = my_data.x.vdata;

I almost never use unions, so hopefully I haven’t made any syntax errors above. Hopefully you get the idea.

Your explanation is clear. “load multiple adjacent data elements in a structure simultaneously” just works in the same degree of ‘perfect’ coalesce.

However, what I read from another lecture slide seems to describe differently.

Does that mean align() can make the traditional 4 128-Byte loading into one?

EDIT: I can share the pdf. How to upload attachment on this forum?

Perhaps you should review some material on optimizations for memory access:


The structure you are showing now has a total of 20 bytes natively. So the align directive will cause that structure to start on a proper 16-byte boundary, so that you could load some of the elements with a vector-type load. There may be some utility in that, but I would strongly consider re-organizing an array of such structures into the preferred structure-of-arrays format instead. This statement doesn’t make sense:

“compiles into two 128-bit load instructions instead of five 32-bit load instructions.”

The only thing being shown is a structure definition. We cannot determine how it will load until we see the actual instructions used to load it. It could compile into two 128-bit load instructions if you use something like the int4 load approach I’ve already discussed. But that still doesn’t adequately explain the intent for a 20-byte structure (two 128-bit loads is 32 bytes total. So what is the intent?)

align() cannot make “the traditional 4 128-Byte loading into one”

Nothing can do that. The question itself suggests a lack of understanding, which is why I suggest reviewing the optimization material I linked.

The only thing the align directive can do is help with vector-type loads as I’ve already indicated, and possibly reduce a warp load from two cachelines to one, in some cases, but that is simply moving a problem from one place to another in your code.

Trying to learn basics of CUDA programming this way isn’t very efficient. I’d suggest taking advantage of some of the educational resources such as the webinars here:


Thanks for sharing the material. I’ve been with such issue for long. Hope the resources will help me out.