Memory coalescing

Hi

I’d just like to confirm my understanding of memory coalescing. The access pattern should conform to BaseAddress + N where N is the number of the thread in the half-warp. The BaseAddress should be aligned to 16 bytes. You can test for this like this: address & 0xf == 0. An example of non-coalesced writes is in the CUDA SDK transpose project. Printing the in and out indices of the naive transpose shows the access pattern:

index_out = 0 index_in = 0
index_out = 16 index_in = 1
index_out = 32 index_in = 2
index_out = 48 index_in = 3
index_out = 64 index_in = 4
index_out = 80 index_in = 5
index_out = 96 index_in = 6
index_out = 112 index_in = 7
index_out = 128 index_in = 8
etc…

The writes are obviously non-coalesced. Reads are coalesced because address locations are accessed sequentially per thread.

Does this make sense?