I concur with rs277
I don’t have that book, and Nick Wilt is undoubtedly not wrong, as far as the scope of his material goes. However Nick left NVIDIA in 2010, which was Fermi era as far as the world was concerned. Kepler would have been NDA at that point. I’m not sure when the book was written, exactly, but it seems likely that he had only (Tesla and) Fermi and Kepler in view.
At any rate what follows is just conjecture. I don’t have the book, and don’t know what he meant precisely. If you study the context carefully, you may be able to do a better job of conjecture.
Q1:
I think perfect coalescing here might translate to ((bytes requested)/(bytes retrieved)) = 1.0, considered on a request by request basis.
The L1 cacheline in both Kepler and Fermi was 128 bytes wide. A miss would trigger a load of 128 bytes, unless the L1 cache was disabled (could be disabled under programmer control, or Kepler had it disabled by default for certain types of loads). If the L1 is enabled, then, a warp-wide miss would trigger a 128 byte load. If you only need 2 bytes per thread (16-bit words) then you are loading 128 bytes while only “needing” 64. That corresponds to 50% efficiency. Such a narrow view might be what was being communicated, although if you have adjacent warps reading adjacent data, you are probably going to observe an actual efficiency considered across many requests that is much closer to 1.0. That is in fact a key purpose of the L1.
Later architectures also had a 128 byte L1 cacheline but it was broken into sectors (4, I believe), so even 8 bit loads warp-wide could achieve the 1.0 efficiency number.
Q2:
AFAIK, for Fermi and Kepler, the L1 cacheline had a 128-byte alignment. That means that if you requested 128 bytes warp-wide (e.g. 32 bits per thread, adjacent) but the starting point was not a 128-byte aligned boundary, that request would still translate into 2 transactions, effectively touching 2 different cachelines. Nick Wilt probably knows something I don’t, but if I were teaching CUDA back in those days, I would have pointed out that if we consider a single warp-wide request, the 128-byte alignment matters. Our efficiency here drops to 50%. (256 bytes loaded, for 128 bytes requested). However, just as in my response to Q1, the L1 cache has a principal objective to help “smooth” such activity, and was a significant benefit over the Tesla generation (cc1.x) in this respect. For adjacent warps reading adjacent data, considered over many such requests, the actual efficiency for this case approaches 1.0, courtesy of the L1. Fermi also had a not-well-publicised “hotclock” architecture which effectively split processing in half, over two “hotclock cycles”. The first cycle would process a half warp, and the second cycle would process the other half warp. That might seem to suggest a 64-byte granularity but at the moment I don’t think I can connect those ideas, so I would just say that for Fermi, based on my conjecture of what Nick Wilt might have meant, I would have said the alignment to pay attention to would have been 128 byte alignment, not 64. So obviously I also am “missing something”.
For me, my view is that these weren’t primary considerations even in those days for “bulk” access, and are even less of a consideration on “modern” architectures. When I was teaching CUDA in those days, I would put up the usual coalescing diagrams and point out that “an unaligned but otherwise adjacent load immediately reduced the efficiency to 50%. Does that matter? It depends. If you are bulk loading data (a long vector) with reasonable temporal locality, then it doesn’t matter - the efficiency approaches 100% due to the L1. But if that type of request dominates your load pattern (i.e. loading unaligned here, there and everywhere) then it certainly matters.”
Tesla (cc1.x) GPUs were even “more different”, so it’s possible the 64-byte granularity might have applied there. Whatever I knew about that architecture I have mostly forgotten.