Loading structured data efficiently using CUDA can this be right?


there is this document by NVIDIA:

and in the section ‘Custom Structures’ is says that
SoA (structure of arrays) is the preferable approach for many cases for data-parallel computations because it groups related data into a contigous array.’

Can this be true? In SoA all elements of a struct get their own array, so this shouldn’t be groupt contigously. Is this a typo and should say AoS???


This PDF talks about applications in which successive threads load successive structures from memory. In such cases - Having SOA will result in coalesced accesses…

There are apps in which successive blocks have to access successive structures. In such cases, the threads inside the block could cache global memory in a coalesced way like this:

Say you have an array of structure like this:

struct hello


 Â  int i, j, k;

 Â  float f[20];


And say you have

struct hello arrayhello[200];

Say you spawn 200 blocks and each block has to load this structure into shared memory…

you could do like this:

int *src = (int *) &globalAOS[blockIdx.x];

int *dst = smemAOS;

for(int i=threadIdx.x; i < sizeof(struct hello)/sizeof(int); i+=blockDim.x)


 Â  Â dst[i] = src[i];


So, if your structure is somewhat big enough (atleast 32*4 bytes) and the size is a multiple of 4 – you would get coalescing automatically…

Arrays of structures can be efficient, as long as the structure is 32, 64, or 128 bytes because structures of this size can be coalesced. For anything larger, use SoA or you will cry (performance penalty for non-coalesced accesses is a factor of 10-20).

You mean the coalesced condition with computational ability 1.2 is that right.

Because with CUDA 1.0 the coalesced structure should have 32, 64 or 128 bits length (that is int, int2, int 4) to be coalesced

No, I simply meant bits instead of bytes. Sorry for any confusion.

What if my structure is:

struct align(8) float6 {

float2 u, v, w;


Inspecting .ptx reveals that data is stored via:

st.global.v2.f32 [%r52 + 0], {$f1, $f2}

st.global.v2.f32 [%r52 + 8], {$f3, $f4}

st.global.v2.f32 [%r52 + 16], {$f5, $f6}

Are these writes coalesced?

If so, then I expect that the following structure also guarantees coalesced reads:

struct align(16) float8 {

float4 u, v;





You have 3 separate memory writes spanning the contiguous bytes of the struct so none of them are coalesced.

Look, you don’t have to just believe me OK. Write a micro benchmark and measure the bandwidth. Run it through the profiler (assuming you are on a machine that supports the profiler counters) and see what it has to say about incoherent loads/writes.

Or just look it up in the programming guide, it only takes 10s

from section

I assumed that compiler will do some magic to make these reads coalesced. But I guess the magic has to be done by the developer (say via shared memory) since, as a quick test revealed, the compiler is unable to make these reads coalesced: the bandwidth is just 16GB/s compared to max 80GB/s from coalesced reads.

Is this doc still around??

Im trying to find some documentation on using structs with cuda