I would like to understand some things I couldn’t find in the manuals.
For example, if 256 threads (a block) are reading a 256 elements array each one (a “private” array per thread):
a) The array is in Local Memory: I understand that each thread reads its element in parallel with the other 16 threads of the same half-warp (is this true for Local Memory accesses?).
b ) The arrays (one per thread) are in Global Memory (perfect coalescing): You have to do the same number of reads than Local Memory. You read 16 elements in parallel (half-warp).
So, if the latency is the same (manuals say that), is the performance the same too?
Then, why people prefer to use global memory?