questions about coalescing access coalescing access

Hi,

  1. I have a global array called ‘array_global’ from which I load data into shared memory. The access pattern is as follows:

Threads (0,1,2,3,…,31) load data from array location (n,n+2,n+4,n+6,…,n+64)

“n” is that starting index corresponding to thread 0. Would this access pattern be coalesced? Would this access pattern be coalesced if it is inside a conditional statement which does not evaluate identically across all threads in the block? Are accesses inside conditional statements which do not evaluate identically across a block coalesced?

  1. I have another global array called ‘array_global2’ into which I transfer data from shared memory. The access pattern is more complicated and it also is inside a conditional statement which does not evaluate identically across the block . The access pattern is

Thread (0,1,2,3,…,31) each write to eight locations. The eight locations are

Thread 0 writes to (n , n+1, j , j+1, n+K , n+1+K, j+K, j+1+K) in array global
Thread 1 writes to (n+2, n+3, j+2, j+3, n+2+K, n+3+K, j+3+K,j+3+K) in array global

And so on up to thread 31. Are these accesses coalesced?

Thanks.

-Nachiket

please see section 3.2.1 of NVIDIA_CUDA_BestPractivesGuide_2.3.pdf

suppose a segment is aligned to 128 bytes (data type is float), and

starting address array_global[n] of following code is multiple of 128 bytes

Threads (0,1,2,3,...,31) load data from array location (n,n+2,n+4,n+6,....,n+64)

then first half warp (0,1,…,15) load data from (n,n+2,n+4,n+6,…,n+32) would be merge into one 128 bytes transaction

and second half warp also issue a 128-byte transaction. So two 128-byte transactions are issued but you only fetch 128 byte

from them since stride of array is 2. In other words, bandwidth is at most half of maximum effective bandwidth.

moreover you also have bank-conflict problem when storing data into shared memory.

please see section 5.1.2.5 of NVIDA_CUDA_Programming_Guide_2.3.pdf.

It depends which location of shared memory you write

Thanks for your reply. I am unclear why I have a bank conflict while storing data into shared memory. I have 32 threads. A memory access request for a warp is split up into two half-warp requests of 16 threads. Each of these threads accesses a different shared memory location - why would there be a bank conflict?.

I am still unclear about whether conditional statements affect coalescing.

Thanks again,

-Nachiket

I don’t think you’ve given enough information to diagnose a bank conflict. There is nothing here about the shared memory access pattern.

For coalescing - first we need to know your compute capability and the size of the data you are loading. Without that all we can really do is link you to the bits in the docs, which I assume you’ve read already.

Conditional statements are fun. If the code diverges on it (ie. two threads within a warp follow different paths) then I believe the two paths serialise - threads down one path can have coalesced reads, and threads down the other can also have coalesced reads, but they can’t coalesce with each other.

This is interesting actually. Since I believe that loads are coalesced on half-warp boundaries, does anyone know if it is possible to achieve full bandwidth with a warp that has been split right down the middle?

I would have thought so. That said, I’m not too knowledgable about the hardware so couldn’t say for certain.

My impression is that this doesn’t coalesce at all on compute 1.1 hardware.

Hi,

My shared memory access pattern is as follows:

  1. The ith thread accesses the ith location in the shared memory array - a simple one to one pattern.

  2. My compute capability is 1.3 - a GTX 280. I am loading either floats or doubles - is that what you meant by size of data? I am not loading any user-defined data types.

Thanks,

-Nachiket

Okies - shared memory is fine then.

Given that you are CC 1.3, LS Chien was correct. Your float access with coalesce into one 128 byte read per half warp (50% efficiency), and your double access will coalesce into two 128 byte reads per half warp (again, 50% efficiency). This is assuming that n is a multiple of 16. If it’s not, then you’ll be worse off.

Your second access patern seems to be exactly the same. The addresses you write to aren’t conditional, so you should be able to do that outisde of the (potentially) divergent area. Might depend on the code.