Recommended coalesced access word size

It can’t be stressed more that memory access among threads should be coalesced. My question is, what is the word size that should a single thread read/write to memory. To make it more clear, let’s look at the following simple illustration of memory:


| word_0 | word_1 | word_2 | … | word_k |

From what I understand, thread 0 is supposed to access word_0, thread 1 access word_1 and so on.
If my data are 32-bit integers, then we can talk about 4 bytes words.
Now suppose my data are unsigned long int (i.e. 64-bit integers), is accessing 8 bytes at a time by a thread still considered coalescent access?

Another related question, what is the most optimal size of (word_i) a thread should access?
Does this optimal word size differ from a GPU to another?

Striving for coalesced access involves attempting to arrange the addresses requested by each thread in a warp such that the minimum number of memory (or cache) transactions will be required to satisfy the request. Very often this may be 1 (128 byte) transaction, but when requesting 8 or more bytes per thread, the result will necessarily be 2 or more transactions per request. As long as your addresses result in adjacent, contiguous access, you have generally met the goal.

If you’d like to learn more about the details, I suggest you study slides 30-48 (at least) of this presentation:

[url]http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf[/url]

From an efficiency standpoint, it is often the case that increasing the amount of data requested by each thread up to the maximum of 16 bytes per thread per transaction may result in even more efficient use of the memory subsystem. However the benefits and applicability of this vary based on the code you are actually writing, and in many cases it may be nearly as fast just to request 4 or 8 bytes per thread.