efficient global memory access 32-, 64- or 128-bit loads ?


I quote from CUDA Programming Guide V1.1:

I’d like to load float4 types from global memory. I can organize this structure either as a SOA or a AOS.

Considering the quotation above I wonder what is more efficient: Using the SOA and 4 32-bit loads or using the AOS and 1 128-bit load ?

I always need all elements together so I wouldn’t fetch redundant bits with a 128-bit load.

To me it sounds like using 32-bit loads is more efficient since “coalesced 128-bit accesses deliver a noticeably lower bandwidth than coalesced 32-bit accesses”. However I am not certain, what is your experience ?

Thanks, quak

MisterAnderson42 recently mentioned that fetching float4 values from a 1D texture will allow you to reach the full memory bandwidth:


Thank you for the link.

However I do not really know how to interpret the results.

What I am interested in is the read from global memory as textures are not an option.

These results obviously suggest 32-bit loads. However I do understand why the 128-bit loads need that much more time but the bandwidth is not that bad (in relation to the time-difference).

Another thing: I have seen a presentation chart that suggested, that one load has to finish before a second one can be issued, i.e. if I do two consecutive loads it takes around 800 cycles until both data words have arrived, not 401 as I would have expected. Is that true ?

My experience has been that you can achieve highest bandwidth with 64-bit accesses (float2, for example).

The presentation chart you saw is incorrect. Independent loads will get pipelined. Only an instruction that needs an argument from a load will block until the load completes (instructions are issued in order). Can you tell we where to find that presentation?


I suspect that is because each test reads the same number of array elements, not the same number of bytes. The float4 benchmark will read 128 bits per array element, so it will transfer 4x more data than the float benchmark.

Have a look at http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf on page 81. It is probably a misunderstanding but I would like you to clarify, please.

If that is true, 64-bit loads appear to be the best choice like paulius suggested.

Why are textures not an option? Just use a 1D texture bound to global memory, like I did in that test. If they still aren’t an option, you can uninterleave your memory. Instead of reading float4 element i, read float elements i, i + pitch, i + pitch2, and i + pitch3 with pitch chosen such that it is a multiple of 32 and the elements of what was a float4 array spread out into the proper locations. Four 32-bit loads will give much better throughput than one 128-bit load. A slightly simpler method would be to only uninterleave into 2 arrays and perform 64-bit loads which Paulis notes above as being faster.

the “float” test used 32-bit loads, the “float4” test used 128-bit loads. The number of array elements was constant, so the float4 transferred more total data. You can look at the code (and run it) yourself if you scroll farther down in that post.

Textures are not an option because I need the texture cache already for different data.

That’s the way to go, thanks alot.

The presentation is fine. The slide says that a load instruction blocks dependent instructions (those are the instructions using the load results as an argument).

I see how the figure at the bottom could creat some confusion, as it does not explicitly visualize the dependent instructions which are implied. That’d be my fault, not Mark’s, as I made the figure a while ago.


Thank you for the clarification, I didn’t relate the “dependent” key word to this figure.