and in the section ‘Custom Structures’ is says that
‘SoA (structure of arrays) is the preferable approach for many cases for data-parallel computations because it groups related data into a contigous array.’
Can this be true? In SoA all elements of a struct get their own array, so this shouldn’t be groupt contigously. Is this a typo and should say AoS???
This PDF talks about applications in which successive threads load successive structures from memory. In such cases - Having SOA will result in coalesced accesses…
There are apps in which successive blocks have to access successive structures. In such cases, the threads inside the block could cache global memory in a coalesced way like this:
Say you have an array of structure like this:
struct hello
{
 int i, j, k;
 float f[20];
};
And say you have
struct hello arrayhello[200];
Say you spawn 200 blocks and each block has to load this structure into shared memory…
you could do like this:
int *src = (int *) &globalAOS[blockIdx.x];
int *dst = smemAOS;
for(int i=threadIdx.x; i < sizeof(struct hello)/sizeof(int); i+=blockDim.x)
{
  dst[i] = src[i];
}
So, if your structure is somewhat big enough (atleast 32*4 bytes) and the size is a multiple of 4 – you would get coalescing automatically…
Arrays of structures can be efficient, as long as the structure is 32, 64, or 128 bytes because structures of this size can be coalesced. For anything larger, use SoA or you will cry (performance penalty for non-coalesced accesses is a factor of 10-20).
You have 3 separate memory writes spanning the contiguous bytes of the struct so none of them are coalesced.
Look, you don’t have to just believe me OK. Write a micro benchmark and measure the bandwidth. Run it through the profiler (assuming you are on a machine that supports the profiler counters) and see what it has to say about incoherent loads/writes.
Or just look it up in the programming guide, it only takes 10s
I assumed that compiler will do some magic to make these reads coalesced. But I guess the magic has to be done by the developer (say via shared memory) since, as a quick test revealed, the compiler is unable to make these reads coalesced: the bandwidth is just 16GB/s compared to max 80GB/s from coalesced reads.