Structures of Arrays vs Arrays of Structures?

Heard this banded about a lot, but don’t entirely understand it.

Structures of arrays are preferred to arrays of structures because the data is arranged such that each thread (in a half warp) can access data in the same area of memory at once, otherwise known as coalescing, as we trying to avoid having two or more blocks in your grid trying to access the same area of memory.

For example if you had a point class with x, y, z.

AOS:

struct Point

{

	float x;

	float y;

	float z;

}

Point* points;

you would end up with this:

x y z x y z x y z

SOA:

struct Points

{

	float* x;

	float* y;

	float* z;

}

you would end up with this:

x x x y y y z z z

Can someone explain exactly how SOA is better than AOS. If there are thousands of points, to access each x, y and z in separate threads would mean accessing completely different areas of memory at once (and not necessarily unique to this block)?

Finally, if this is the case, why are we being encouraged to use cuda’s float3, float4 etc, which when you have lots of them are just arrays of structures? I can understand that manually aligning them in memory or by using them in a coalesced fashion in shared memory will get round this, but why not just have arrays of x, y and z in our own data structures to start with?

True, if you take any one thread in the SOA case, you’ll only see random reads and writes. But the other threads you have do very similar “random” I/O, and because it’s done really at the same time, the multiprocessor can optimize and read it in one go.

Maybe it helps to think of the multiprocessor as a vector processing unit which is 16 floats wide. (Or 3 for the sake of the example…) So you have the data structure:

x x x y y y z z z

For example you want to calculate z = x + y. What happens is:

  • thread 1 wants array[0], thread 2 wants array[1], thread 3 wants array[2]. MP reads floats 0,1,2 (that’s one block, cool, we can do it fast).

  • every thread gets to its next instruction, we read a[2], a[3] and a[4]. Can be read in one block again.

  • etc.

By the way, AOS is better for sequential cores with caches. SOA is really bad here:

  • let’s do task 1. read a[0]. (cache thinks we’ll need something around there next, maybe 1 or 2? but no) read a[3], then write a[6]. Lots of cache misses.

  • same with task 2.

  • etc.

So to respond to your question: you don’t access completely different blocks at the same time. And it doesn’t really count what you do sequentially (from a multiprocessor point of view). Or not without optimizations I don’t yet know about :)

For float4s… I wouldn’t think it’s a good idea to read an array of float4s for the same reason if you have better alternatives, but it’s still better than reading random floats if you’ve got no correlation in what your threads do. (A typical example I can remember is a kernel for processing text strings, each thread a different one.)

float, float2 and float4 (or equivalent structures aligned to 32, 64 and 128 bits) are all coalesced and can be accessed with near peak bandwidth. A float3, if there ever was one, would not be properly aligned though. Anything higher than 4 will cause lower performance.

The Programming Guide touches on this, you can look up “coalescing”.