coalesced access of structure of array

I have structure of array in global memory:
ttt,…ttt; aaa,…,aaa; fff,…,fff;
I can sort the threads to make them access the array in a coalesced way.

For some cases, the cost of sorting grow too large and I won’t sort the threads. Then the threads access the array with a random id.
Now, I’m wondering whether array of structure (taf,taf,…,taf) wins over structure of array, because each thread access t,a,f in a chain.

With sorted threads, accessing SoA is coalesced because one thread accesses adjacent data for other threads?
With random access, say, thread 0 accesses t12,a12,f12, thread 1 accesses t39,a39,f39… When thread 0 accesses t12, will it fetch a12 and f12? Or it just fetches t11, t13 etc and t11, t13 are just wasted?

Thanks!

I feel like the best way to choose between SoA and AoS is to ask yourself, each time I access the structure, do I need all of it?

What I mean is, if you have an AoS and you access each structure individually, are you using every member of the structure? If so then AoS is probably better.

But if not then use SoA because if you’re just accessing one particular set of data, you’d probably have better cache coherency accessing just one contiguous block.

But that’s just my 2 cents leftover from CPU-based programming so it may not apply here.

Hopefully someone more confident than me will chime in.

Thanks for your reply. Your observation makes sense on CPU.
SoA is usually preferred on GPU because one thread may copy data to cache for other threads.
AoS is better on CPU when I use all members of the structure because one step may copy data to cache for later steps.

What I’m confused about is whether the latter case, namely caching for later use, holds for GPU or cache is not large enough and is flushed every step by not a few threads.

Only one way to find out. You gotta write both versions and profile it for cache coherency.

Edit : You can also configure the L1 cache to use the 48 KB of memory and let the shared memory blocks only use 16 KB.