An example of coalesced memory access


I have a kernel which is finding the reverse of elements of an array. I am using multiple blocks.

The following situation says what I am doing:

Reading from Global memory

Thread ID	 Value accessesd from Global memory 

Thread 0	   a[0]

Thread 1	   a[1]

Thread 2	   a[2] 

Thread 3	   a[3]

Thread 4	   a[4]

Thread 5	   a[5]

Thread 6	   a[6]

... ...  

... ... 

... ...

Thread 15	  a[15]

Writing to global memory

Thread ID	 Value written to Global memory

Thread 0	   a[15]

Thread 1	   a[14]

Thread 2	   a[13] 

... ...

... ...

Thread 12	   a[3]

Thread 13	   a[2]

Thread 14	   a[1]

Thread 15	   a[0]

Can anybody conform if the memory writes are non-coalesced? When I am profiling my kernel it shows many incohrenet stores, even though in writing to global memory we are still writing at consecutive memory locations.


consecutively increasing memory locations is a condition for coalesced access… No? (atleast on compute 1.0 cards…)

On compute 1.0 and 1.1 devices, your write will be completely uncoalesced, resulting in 16 separate memory transactions. Compute capability 1.2 and up can deal with this write pattern as one transaction.

The solution for this problem is to use shared memory as a staging area to write out the elements in reverse order, then write out the shared memory to global memory in normal order.