Hi,
I have a kernel which is finding the reverse of elements of an array. I am using multiple blocks.
The following situation says what I am doing:
Reading from Global memory
Thread ID Value accessesd from Global memory
Thread 0 a[0]
Thread 1 a[1]
Thread 2 a[2]
Thread 3 a[3]
Thread 4 a[4]
Thread 5 a[5]
Thread 6 a[6]
... ...
... ...
... ...
Thread 15 a[15]
Writing to global memory
Thread ID Value written to Global memory
Thread 0 a[15]
Thread 1 a[14]
Thread 2 a[13]
... ...
... ...
Thread 12 a[3]
Thread 13 a[2]
Thread 14 a[1]
Thread 15 a[0]
Can anybody conform if the memory writes are non-coalesced? When I am profiling my kernel it shows many incohrenet stores, even though in writing to global memory we are still writing at consecutive memory locations.
Thanks