I quote from CUDA Programming Guide V1.1:
I’d like to load float4 types from global memory. I can organize this structure either as a SOA or a AOS.
Considering the quotation above I wonder what is more efficient: Using the SOA and 4 32-bit loads or using the AOS and 1 128-bit load ?
I always need all elements together so I wouldn’t fetch redundant bits with a 128-bit load.
To me it sounds like using 32-bit loads is more efficient since “coalesced 128-bit accesses deliver a noticeably lower bandwidth than coalesced 32-bit accesses”. However I am not certain, what is your experience ?