questions about coalescing access coalescing access

Nachiket_Gokhale · November 12, 2009, 8:05pm

Hi,

I have a global array called ‘array_global’ from which I load data into shared memory. The access pattern is as follows:

Threads (0,1,2,3,…,31) load data from array location (n,n+2,n+4,n+6,…,n+64)

“n” is that starting index corresponding to thread 0. Would this access pattern be coalesced? Would this access pattern be coalesced if it is inside a conditional statement which does not evaluate identically across all threads in the block? Are accesses inside conditional statements which do not evaluate identically across a block coalesced?

I have another global array called ‘array_global2’ into which I transfer data from shared memory. The access pattern is more complicated and it also is inside a conditional statement which does not evaluate identically across the block . The access pattern is

Thread (0,1,2,3,…,31) each write to eight locations. The eight locations are

Thread 0 writes to (n , n+1, j , j+1, n+K , n+1+K, j+K, j+1+K) in array global
Thread 1 writes to (n+2, n+3, j+2, j+3, n+2+K, n+3+K, j+3+K,j+3+K) in array global

And so on up to thread 31. Are these accesses coalesced?

Thanks.

-Nachiket

LSChien · November 13, 2009, 2:00am

please see section 3.2.1 of NVIDIA_CUDA_BestPractivesGuide_2.3.pdf

suppose a segment is aligned to 128 bytes (data type is float), and

starting address array_global[n] of following code is multiple of 128 bytes

Threads (0,1,2,3,...,31) load data from array location (n,n+2,n+4,n+6,....,n+64)

then first half warp (0,1,…,15) load data from (n,n+2,n+4,n+6,…,n+32) would be merge into one 128 bytes transaction

and second half warp also issue a 128-byte transaction. So two 128-byte transactions are issued but you only fetch 128 byte

from them since stride of array is 2. In other words, bandwidth is at most half of maximum effective bandwidth.

moreover you also have bank-conflict problem when storing data into shared memory.

please see section 5.1.2.5 of NVIDA_CUDA_Programming_Guide_2.3.pdf.

It depends which location of shared memory you write

Nachiket_Gokhale · November 13, 2009, 3:48pm

Thanks for your reply. I am unclear why I have a bank conflict while storing data into shared memory. I have 32 threads. A memory access request for a warp is split up into two half-warp requests of 16 threads. Each of these threads accesses a different shared memory location - why would there be a bank conflict?.

I am still unclear about whether conditional statements affect coalescing.

Thanks again,

-Nachiket

Tigga · November 13, 2009, 4:04pm

I don’t think you’ve given enough information to diagnose a bank conflict. There is nothing here about the shared memory access pattern.

For coalescing - first we need to know your compute capability and the size of the data you are loading. Without that all we can really do is link you to the bits in the docs, which I assume you’ve read already.

Conditional statements are fun. If the code diverges on it (ie. two threads within a warp follow different paths) then I believe the two paths serialise - threads down one path can have coalesced reads, and threads down the other can also have coalesced reads, but they can’t coalesce with each other.

Gregory_Diamos · November 13, 2009, 4:42pm

This is interesting actually. Since I believe that loads are coalesced on half-warp boundaries, does anyone know if it is possible to achieve full bandwidth with a warp that has been split right down the middle?

Tigga · November 13, 2009, 4:47pm

I would have thought so. That said, I’m not too knowledgable about the hardware so couldn’t say for certain.

cbuchner1 · November 13, 2009, 5:57pm

My impression is that this doesn’t coalesce at all on compute 1.1 hardware.

Nachiket_Gokhale · November 13, 2009, 6:37pm

Hi,

My shared memory access pattern is as follows:

The ith thread accesses the ith location in the shared memory array - a simple one to one pattern.
My compute capability is 1.3 - a GTX 280. I am loading either floats or doubles - is that what you meant by size of data? I am not loading any user-defined data types.

Thanks,

-Nachiket

Tigga · November 13, 2009, 7:01pm

Okies - shared memory is fine then.

Given that you are CC 1.3, LS Chien was correct. Your float access with coalesce into one 128 byte read per half warp (50% efficiency), and your double access will coalesce into two 128 byte reads per half warp (again, 50% efficiency). This is assuming that n is a multiple of 16. If it’s not, then you’ll be worse off.

Your second access patern seems to be exactly the same. The addresses you write to aren’t conditional, so you should be able to do that outisde of the (potentially) divergent area. Might depend on the code.

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	2238	December 9, 2010
coalescing problem CUDA Programming and Performance	4	1133	August 8, 2011
Do these two global memory coalesced access pattern have same performance in thoery? CUDA Programming and Performance cuda	3	381	December 17, 2022
Newbie question regarding global load CUDA Programming and Performance	2	1697	September 2, 2008
Memory coalescing in one thread CUDA Programming and Performance	17	16811	March 31, 2011
Is these way coalesced access? CUDA Programming and Performance	0	421	March 6, 2020
Coalesced acces slower than non coalesced CUDA Programming and Performance	4	946	February 7, 2011
How to understand the alignment of 2D array and fully coalesce of the memory access CUDA Programming and Performance	7	3644	July 27, 2016
Accessing same global memory address within warps CUDA Programming and Performance	4	4446	October 24, 2018
understanding (half-)wraps CUDA Programming and Performance	9	4564	October 27, 2010

questions about coalescing access coalescing access

Related topics