Require clarification for Memory coalescing?

,

Assumption - for all the below questions I assume, 32 bit word size (data bus size) , data to be operated upon - 16 bit, 32 bit.

memory alignment structure - 32 bytes or 64 bytes

Question 1 -

Assumption for Question 1 - data being operated upon is of size 16 bit, word size (data bus size) 32 bits, and memory alignment structure is 64 bytes.

In the CUDA handbook , it is mentioned “Reading or writing 16-bit words is always uncoalesced.”. Not able to figure out why?

I believe data of size 16 bits can achieve memory coalescing.

My thought process -

16 bit is equivalent to 2 bytes. Suppose, if the memory structure alignment in the GPU is 64 bytes. So, 32 variables of size 2 bytes (16 bits) can be perfectly memory aligned. So successive threads can read data from successive memory location. for example suppose thread_0 can read data from nth memory location, thread_1 can read data from ( n + sizeof(variable_being_operated))th memory location and so on, where sizeof(variable_being_operated) = 16 bits or 2 bytes. So, it seems to me like with the 64 byte memory alignment, 32 bit word size and 32 bit data size coalescing can be achieved.

Obviously I am wrong because the Cuda handbook mentions for 16 bit data size coalescing can not be achieved. What am I missing in my understanding?


Question 2-

Assumption for Question 2 - data being operated upon is of size 32 bit, word size (data bus size) 32 bits, and memory alignment structure is 32 bytes

In the CUDA handbook , under the topic COALESCING constraints, they mention "for successful coalescing, for a 32 bit word, the Memory Alignment criteria should be 64 bytes.". Not able to figure out why?

But I believe for 32 bit data size, 32 byte memory alignment coalescing can be achieved.

My though process - assume the memory alignment structure is 32 bytes and suppose the data being operated upon is of 32 bits (4 bytes). Eight, 4 bytes variables can perfectly align with in a 32 byte memory alignment structure. So successive threads can read data from successive memory location. for example thread_0 can read data from nth memory location, thread_1 can read data from
( n + sizeof(variable_being_operated))th location and so on, where sizeof(variable_being_operated) = 32 bits or 4 bytes. So, it seems to me like with the 32 byte memory alignment, 32 bit word size and 32 bit data size coalescing can be achieved.

Obviously I am wrong because the Cuda handbook mentions for 32 bit variable size to achieve coalescing memory alignment needs to be 64 byte. What am I missing in my understanding?

1 Like

I think possibly the book you are relying on, is espousing somewhat dated information, from the Kepler CC 3.0 era and things have changed somewhat since then.

Quoting from the “Best Practices Guide”:

"The access requirements for coalescing depend on the compute capability of the device and are documented in the CUDA C++ Programming Guide.

For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp."

2 Likes

I concur with rs277

I don’t have that book, and Nick Wilt is undoubtedly not wrong, as far as the scope of his material goes. However Nick left NVIDIA in 2010, which was Fermi era as far as the world was concerned. Kepler would have been NDA at that point. I’m not sure when the book was written, exactly, but it seems likely that he had only (Tesla and) Fermi and Kepler in view.

At any rate what follows is just conjecture. I don’t have the book, and don’t know what he meant precisely. If you study the context carefully, you may be able to do a better job of conjecture.

Q1:

I think perfect coalescing here might translate to ((bytes requested)/(bytes retrieved)) = 1.0, considered on a request by request basis.

The L1 cacheline in both Kepler and Fermi was 128 bytes wide. A miss would trigger a load of 128 bytes, unless the L1 cache was disabled (could be disabled under programmer control, or Kepler had it disabled by default for certain types of loads). If the L1 is enabled, then, a warp-wide miss would trigger a 128 byte load. If you only need 2 bytes per thread (16-bit words) then you are loading 128 bytes while only “needing” 64. That corresponds to 50% efficiency. Such a narrow view might be what was being communicated, although if you have adjacent warps reading adjacent data, you are probably going to observe an actual efficiency considered across many requests that is much closer to 1.0. That is in fact a key purpose of the L1.

Later architectures also had a 128 byte L1 cacheline but it was broken into sectors (4, I believe), so even 8 bit loads warp-wide could achieve the 1.0 efficiency number.

Q2:

AFAIK, for Fermi and Kepler, the L1 cacheline had a 128-byte alignment. That means that if you requested 128 bytes warp-wide (e.g. 32 bits per thread, adjacent) but the starting point was not a 128-byte aligned boundary, that request would still translate into 2 transactions, effectively touching 2 different cachelines. Nick Wilt probably knows something I don’t, but if I were teaching CUDA back in those days, I would have pointed out that if we consider a single warp-wide request, the 128-byte alignment matters. Our efficiency here drops to 50%. (256 bytes loaded, for 128 bytes requested). However, just as in my response to Q1, the L1 cache has a principal objective to help “smooth” such activity, and was a significant benefit over the Tesla generation (cc1.x) in this respect. For adjacent warps reading adjacent data, considered over many such requests, the actual efficiency for this case approaches 1.0, courtesy of the L1. Fermi also had a not-well-publicised “hotclock” architecture which effectively split processing in half, over two “hotclock cycles”. The first cycle would process a half warp, and the second cycle would process the other half warp. That might seem to suggest a 64-byte granularity but at the moment I don’t think I can connect those ideas, so I would just say that for Fermi, based on my conjecture of what Nick Wilt might have meant, I would have said the alignment to pay attention to would have been 128 byte alignment, not 64. So obviously I also am “missing something”.

For me, my view is that these weren’t primary considerations even in those days for “bulk” access, and are even less of a consideration on “modern” architectures. When I was teaching CUDA in those days, I would put up the usual coalescing diagrams and point out that “an unaligned but otherwise adjacent load immediately reduced the efficiency to 50%. Does that matter? It depends. If you are bulk loading data (a long vector) with reasonable temporal locality, then it doesn’t matter - the efficiency approaches 100% due to the L1. But if that type of request dominates your load pattern (i.e. loading unaligned here, there and everywhere) then it certainly matters.”

Tesla (cc1.x) GPUs were even “more different”, so it’s possible the 64-byte granularity might have applied there. Whatever I knew about that architecture I have mostly forgotten.

1 Like

@rs277 @Robert_Crovella thank you for the explanation and yes the book is written in context fermi and pre fermi.

For context, this is what the book says - “The base address of the warp (the address being accessed by the first thread
in the warp) must be aligned as shown in Table 5.6.”

Table 5.6 Alignment Criteria for Coalescing
WORD SIZE ALIGNMENT
8-bit *
16-bit *
32-bit 64-byte
64-bit 128-byte
128-bit 256-byte

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.