Recommended coalesced access word size

caesaretos · February 15, 2017, 10:03am

It can’t be stressed more that memory access among threads should be coalesced. My question is, what is the word size that should a single thread read/write to memory. To make it more clear, let’s look at the following simple illustration of memory:

| word_0 | word_1 | word_2 | … | word_k |

From what I understand, thread 0 is supposed to access word_0, thread 1 access word_1 and so on.
If my data are 32-bit integers, then we can talk about 4 bytes words.
Now suppose my data are unsigned long int (i.e. 64-bit integers), is accessing 8 bytes at a time by a thread still considered coalescent access?

Another related question, what is the most optimal size of (word_i) a thread should access?
Does this optimal word size differ from a GPU to another?

Robert_Crovella · February 15, 2017, 2:06pm

Striving for coalesced access involves attempting to arrange the addresses requested by each thread in a warp such that the minimum number of memory (or cache) transactions will be required to satisfy the request. Very often this may be 1 (128 byte) transaction, but when requesting 8 or more bytes per thread, the result will necessarily be 2 or more transactions per request. As long as your addresses result in adjacent, contiguous access, you have generally met the goal.

If you’d like to learn more about the details, I suggest you study slides 30-48 (at least) of this presentation:

[url]http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf[/url]

From an efficiency standpoint, it is often the case that increasing the amount of data requested by each thread up to the maximum of 16 bytes per thread per transaction may result in even more efficient use of the memory subsystem. However the benefits and applicability of this vary based on the code you are actually writing, and in many cases it may be nearly as fast just to request 4 or 8 bytes per thread.

Topic		Replies	Views
Question about coalesced memory access CUDA Programming and Performance	10	2885	September 24, 2009
memory coalescing CUDA Programming and Performance	4	5512	June 10, 2011
Require clarification for Memory coalescing? CUDA Programming and Performance hw , cuda	4	2312	October 5, 2023
Whether this is coalescing access several cases to decide CUDA Programming and Performance	0	1596	August 2, 2011
Beginner's question CUDA Programming and Performance	2	512	July 3, 2019
Coalesced Memory access related doubt CUDA Programming and Performance	13	2199	December 9, 2010
Coalesced access with blocks width shorter than 16 CUDA Programming and Performance	5	2739	March 4, 2008
Memory coalescing in one thread CUDA Programming and Performance	17	16767	March 31, 2011
Accessing same global memory address within warps CUDA Programming and Performance	4	4384	October 24, 2018
1 coalesced global memory load = 16 loads? CUDA Programming and Performance	0	950	January 23, 2011

Recommended coalesced access word size

| word_0 | word_1 | word_2 | … | word_k |

Related topics