“For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.”
I’m wondering what the criteria is for memory accesses to be grouped.
I get that writing float4’s in consecutive locations is probably all grouped, but what about something like:
I get that the compiler is probably smart enough to work around this example, but as this type of thing gets more complex, is there a general rule for delaying writes to see if they all coalesce? I could see a warp committing all possible writes that aren’t dependent on reads to memory and then anticipating a stall while they all go out. Does anyone know if it works something like this?
memory operations issued from separate instructions never coalesce with each other. coalescing is a behavior that applies to memory activity emanating from a single instruction.
I think its unlikely that the compiler would have an idiom to somehow take the code you have written and do any sort of coalescing between the work that you are issuing from line 2 and from line 4. However, rather than speculate, such a question could be quickly answered with some SASS code analysis. But to your point, as the if/else construct gets more complex, I think its even less likely that the compiler would have an idiom to reduce it to a single instruction. Unless it is reduced to a single instruction, coalescing (across 2 or more instructions) is not possible.
I’m using coalescing here in what I consider the formal sense of the word, in CUDA. There is also the concept of a coalescing buffer, which I find to be loosely defined, applicable only to writes, and not that much different than general understanding of the concept of a cache(line). The (L2) cache can act as a “coalescing buffer” by collecting write activity from multiple instructions, before it is “written out” to DRAM, in presumably a minimized set of transactions. This is possible in part because L2 has write-back, not write-through, behavior. And I would be careful to dissociate this usage of the term “coalescing buffer” from the formal use of “coalescing” which IMO is well defined.
Interesting, good to know. I’ve definitely written code declaring a variable outside the if in cases like these to write all after at the same time, good to know that was warranted.
And I guess it’s not as impactful as it would seem, with a large write like st.global.v4.f64 maxing out the current coalescing transaction size of 32 bytes anyway.
And for a coalescing buffer, the only tools really available to sculpt this is shuffling around threadfences and being careful about volatile and hoping for the best, right?
the maximum transaction size per thread is 16 bytes (per instruction). This is covered in the programming guide.
I don’t have any good abstract advice for how to “manage” a “coalescing buffer”, and I’m not really sure I understand your questions anyway.
IMO cache blocking optimizations are one of the last things I would look at on a GPU, and optimizing for something like a cacheline seems rather difficult. But good luck!