efficient global memory access 32-, 64- or 128-bit loads ?

quak · January 3, 2008, 5:34pm

hello,

I quote from CUDA Programming Guide V1.1:

I’d like to load float4 types from global memory. I can organize this structure either as a SOA or a AOS.

Considering the quotation above I wonder what is more efficient: Using the SOA and 4 32-bit loads or using the AOS and 1 128-bit load ?

I always need all elements together so I wouldn’t fetch redundant bits with a 128-bit load.

To me it sounds like using 32-bit loads is more efficient since “coalesced 128-bit accesses deliver a noticeably lower bandwidth than coalesced 32-bit accesses”. However I am not certain, what is your experience ?

Thanks, quak

seibert · January 3, 2008, 7:13pm

MisterAnderson42 recently mentioned that fetching float4 values from a 1D texture will allow you to reach the full memory bandwidth:

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...41&#entry290441[/url]

quak · January 3, 2008, 7:35pm

Thank you for the link.

However I do not really know how to interpret the results.

What I am interested in is the read from global memory as textures are not an option.

These results obviously suggest 32-bit loads. However I do understand why the 128-bit loads need that much more time but the bandwidth is not that bad (in relation to the time-difference).

Another thing: I have seen a presentation chart that suggested, that one load has to finish before a second one can be issued, i.e. if I do two consecutive loads it takes around 800 cycles until both data words have arrived, not 401 as I would have expected. Is that true ?

paulius · January 3, 2008, 8:12pm

My experience has been that you can achieve highest bandwidth with 64-bit accesses (float2, for example).

The presentation chart you saw is incorrect. Independent loads will get pipelined. Only an instruction that needs an argument from a load will block until the load completes (instructions are issued in order). Can you tell we where to find that presentation?

Paulius

seibert · January 4, 2008, 1:35am

I suspect that is because each test reads the same number of array elements, not the same number of bytes. The float4 benchmark will read 128 bits per array element, so it will transfer 4x more data than the float benchmark.

quak · January 4, 2008, 12:36pm

Have a look at http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf on page 81. It is probably a misunderstanding but I would like you to clarify, please.

If that is true, 64-bit loads appear to be the best choice like paulius suggested.

MisterAnderson42 · January 4, 2008, 8:57pm

Why are textures not an option? Just use a 1D texture bound to global memory, like I did in that test. If they still aren’t an option, you can uninterleave your memory. Instead of reading float4 element i, read float elements i, i + pitch, i + pitch2, and i + pitch3 with pitch chosen such that it is a multiple of 32 and the elements of what was a float4 array spread out into the proper locations. Four 32-bit loads will give much better throughput than one 128-bit load. A slightly simpler method would be to only uninterleave into 2 arrays and perform 64-bit loads which Paulis notes above as being faster.

the “float” test used 32-bit loads, the “float4” test used 128-bit loads. The number of array elements was constant, so the float4 transferred more total data. You can look at the code (and run it) yourself if you scroll farther down in that post.

quak · January 5, 2008, 11:12am

Textures are not an option because I need the texture cache already for different data.

That’s the way to go, thanks alot.

paulius · January 7, 2008, 2:07am

The presentation is fine. The slide says that a load instruction blocks dependent instructions (those are the instructions using the load results as an argument).

I see how the figure at the bottom could creat some confusion, as it does not explicitly visualize the dependent instructions which are implied. That’d be my fault, not Mark’s, as I made the figure a while ago.

Paulius

quak · January 7, 2008, 7:42pm

Thank you for the clarification, I didn’t relate the “dependent” key word to this figure.

quak

Topic		Replies	Views
Effective global memory bandwidth? CUDA Programming and Performance	17	17726	September 18, 2007
Uncoalesced global memory bandwidth CUDA Programming and Performance	3	2287	March 28, 2009
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1943	November 1, 2010
why 256byte loads slower than 128byte loads? CUDA Programming and Performance	6	7076	February 11, 2010
coalesced data accesses in global memory CUDA Programming and Performance	1	985	May 11, 2010
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4828	June 10, 2008
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9926	December 14, 2011
why load vector4 not faster than single load? CUDA Programming and Performance	8	852	April 2, 2019
Global memory access for float and integer, the speed is the same, right ? CUDA Programming and Performance	2	3107	March 3, 2012
What is the best way to load global memory? CUDA Programming and Performance	2	477	April 20, 2022

efficient global memory access 32-, 64- or 128-bit loads ?

Related topics