Issues about async on A100

But directly loading from global memory doesn’t need all addr aligns to 128B, it only have coalesced requirements. For example, if 32 threads in a warp need to load 128B data together, thread i can use addr only aligned to 4B and get the best efficiency. But for async copy, even a thread can require for 16B. If some threads need to load 128B together, they need to use addr 0, 16, 32, 48, 64, 80, 96, 112 and endure the loss in efficiency. Do the alignment requirements act in the way I demonstrate in the example?

After you wrote, I am not so sure anymore myself.