I have a global array called ‘array_global’ from which I load data into shared memory. The access pattern is as follows:
Threads (0,1,2,3,…,31) load data from array location (n,n+2,n+4,n+6,…,n+64)
“n” is that starting index corresponding to thread 0. Would this access pattern be coalesced? Would this access pattern be coalesced if it is inside a conditional statement which does not evaluate identically across all threads in the block? Are accesses inside conditional statements which do not evaluate identically across a block coalesced?
I have another global array called ‘array_global2’ into which I transfer data from shared memory. The access pattern is more complicated and it also is inside a conditional statement which does not evaluate identically across the block . The access pattern is
Thread (0,1,2,3,…,31) each write to eight locations. The eight locations are
Thread 0 writes to (n , n+1, j , j+1, n+K , n+1+K, j+K, j+1+K) in array global
Thread 1 writes to (n+2, n+3, j+2, j+3, n+2+K, n+3+K, j+3+K,j+3+K) in array global
And so on up to thread 31. Are these accesses coalesced?
Thanks for your reply. I am unclear why I have a bank conflict while storing data into shared memory. I have 32 threads. A memory access request for a warp is split up into two half-warp requests of 16 threads. Each of these threads accesses a different shared memory location - why would there be a bank conflict?.
I am still unclear about whether conditional statements affect coalescing.
I don’t think you’ve given enough information to diagnose a bank conflict. There is nothing here about the shared memory access pattern.
For coalescing - first we need to know your compute capability and the size of the data you are loading. Without that all we can really do is link you to the bits in the docs, which I assume you’ve read already.
Conditional statements are fun. If the code diverges on it (ie. two threads within a warp follow different paths) then I believe the two paths serialise - threads down one path can have coalesced reads, and threads down the other can also have coalesced reads, but they can’t coalesce with each other.
This is interesting actually. Since I believe that loads are coalesced on half-warp boundaries, does anyone know if it is possible to achieve full bandwidth with a warp that has been split right down the middle?
The ith thread accesses the ith location in the shared memory array - a simple one to one pattern.
My compute capability is 1.3 - a GTX 280. I am loading either floats or doubles - is that what you meant by size of data? I am not loading any user-defined data types.
Given that you are CC 1.3, LS Chien was correct. Your float access with coalesce into one 128 byte read per half warp (50% efficiency), and your double access will coalesce into two 128 byte reads per half warp (again, 50% efficiency). This is assuming that n is a multiple of 16. If it’s not, then you’ll be worse off.
Your second access patern seems to be exactly the same. The addresses you write to aren’t conditional, so you should be able to do that outisde of the (potentially) divergent area. Might depend on the code.