Conditions of coalescing global memory into few transactions

ratzes · December 22, 2019, 4:47pm

I’m reading the best practices guide and it says

“For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.”

I’m wondering what the criteria is for memory accesses to be grouped.

I get that writing float4’s in consecutive locations is probably all grouped, but what about something like:

if (threadIdx.x % 2 == 0) {
   arr[threadIdx.x] = threadIdx.x;
} else {
   arr[threadIdx.x] = threadIdx.x;
}

I get that the compiler is probably smart enough to work around this example, but as this type of thing gets more complex, is there a general rule for delaying writes to see if they all coalesce? I could see a warp committing all possible writes that aren’t dependent on reads to memory and then anticipating a stall while they all go out. Does anyone know if it works something like this?

Robert_Crovella · December 22, 2019, 8:37pm

memory operations issued from separate instructions never coalesce with each other. coalescing is a behavior that applies to memory activity emanating from a single instruction.

I think its unlikely that the compiler would have an idiom to somehow take the code you have written and do any sort of coalescing between the work that you are issuing from line 2 and from line 4. However, rather than speculate, such a question could be quickly answered with some SASS code analysis. But to your point, as the if/else construct gets more complex, I think its even less likely that the compiler would have an idiom to reduce it to a single instruction. Unless it is reduced to a single instruction, coalescing (across 2 or more instructions) is not possible.

I’m using coalescing here in what I consider the formal sense of the word, in CUDA. There is also the concept of a coalescing buffer, which I find to be loosely defined, applicable only to writes, and not that much different than general understanding of the concept of a cache(line). The (L2) cache can act as a “coalescing buffer” by collecting write activity from multiple instructions, before it is “written out” to DRAM, in presumably a minimized set of transactions. This is possible in part because L2 has write-back, not write-through, behavior. And I would be careful to dissociate this usage of the term “coalescing buffer” from the formal use of “coalescing” which IMO is well defined.

ratzes · December 22, 2019, 9:36pm

Interesting, good to know. I’ve definitely written code declaring a variable outside the if in cases like these to write all after at the same time, good to know that was warranted.

And I guess it’s not as impactful as it would seem, with a large write like st.global.v4.f64 maxing out the current coalescing transaction size of 32 bytes anyway.

And for a coalescing buffer, the only tools really available to sculpt this is shuffling around threadfences and being careful about volatile and hoping for the best, right?

Robert_Crovella · December 23, 2019, 2:29am

the maximum transaction size per thread is 16 bytes (per instruction). This is covered in the programming guide.

I don’t have any good abstract advice for how to “manage” a “coalescing buffer”, and I’m not really sure I understand your questions anyway.

IMO cache blocking optimizations are one of the last things I would look at on a GPU, and optimizing for something like a cacheline seems rather difficult. But good luck!

Topic		Replies	Views
Memory coalescing in one thread CUDA Programming and Performance	17	16869	March 31, 2011
Global Memory Coalescing on Devices with Compute Capability 1.2 and Higher CUDA Programming and Performance	3	711	June 4, 2015
memory coalescing CUDA Programming and Performance	4	5546	June 10, 2011
Question about coalesced memory access CUDA Programming and Performance	10	2974	September 24, 2009
coalescing problem CUDA Programming and Performance	4	1160	August 8, 2011
want to know more detail of memory coalescing CUDA Programming and Performance	4	2085	November 11, 2008
Accessing same global memory address within warps CUDA Programming and Performance	4	4497	October 24, 2018
Coalesced writes CUDA Programming and Performance	2	1328	May 26, 2016
how do coalesced writes work CUDA Programming and Performance	9	5146	April 2, 2007
1 coalesced global memory load = 16 loads? CUDA Programming and Performance	0	960	January 23, 2011

Conditions of coalescing global memory into few transactions

Related topics