what is the “transaction” here? I know that the largest bytes each phase can read from global memory is 128B. Does it mean however much data we copy to shared memory, the minimum granularity is actually 64B instead of 32B? The slide indicates the latter 32B data as prefetched, but if we prefetch on every load, it is actually a larger granularity rather than simply a prefetch.
Does it affect memory coalesce from global memory? For example, I have first half of warp reading a contiguous 64B region A, and the other half of warp reading another contiguous 64B region B. Does it perform 2 coalesced access(or transaction) from global memory?
Does the above two rules apply to the cp.async instruction for newer architecture?
A transaction here represents the minimum quantity of data exchange. If the GPU memory controller decides that data is needed from HBM/DRAM, it will issue the necessary activity to retrieve that data. The minimum size of transfer is 1 sector (32 bytes) or based on the GTC talk, perhaps 2 sectors (64 bytes). A sector is a 32-byte adjacent/contiguous block of memory. You cannot request just 1 byte (for example) from DRAM/HBM. The unit of transfer is the sector.
why not provide a link to the talk?
Based on your description, and assuming that region A is not adjacent to region B, even without any information from any particular GTC talk, that would require at least 2 accesses to memory. It might be important to state that “global memory” formally refers to a logical space, whereas “DRAM” or “HBM” refers to a physical backing/resource.
I think a reasonable mental model for cp.async is that with respect to the global space, it creates activity much the same way you would if you were writing code that loaded from the global space. There is no reason to assume that the coalescing behavior or issuing of activity to cover the needed data would be any different.