It can’t be stressed more that memory access among threads should be coalesced. My question is, what is the word size that should a single thread read/write to memory. To make it more clear, let’s look at the following simple illustration of memory:
| word_0 | word_1 | word_2 | … | word_k |
From what I understand, thread 0 is supposed to access word_0, thread 1 access word_1 and so on.
If my data are 32-bit integers, then we can talk about 4 bytes words.
Now suppose my data are unsigned long int (i.e. 64-bit integers), is accessing 8 bytes at a time by a thread still considered coalescent access?
Another related question, what is the most optimal size of (word_i) a thread should access?
Does this optimal word size differ from a GPU to another?