But directly loading from global memory doesn’t need all addr aligns to 128B, it only have coalesced requirements. For example, if 32 threads in a warp need to load 128B data together, thread i can use addr only aligned to 4B and get the best efficiency. But for async copy, even a thread can require for 16B. If some threads need to load 128B together, they need to use addr 0, 16, 32, 48, 64, 80, 96, 112 and endure the loss in efficiency. Do the alignment requirements act in the way I demonstrate in the example?
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| The difference between Asynchronous Copy and Synchronous Copy | 7 | 894 | November 5, 2024 | |
| Can I do async copy from global memory to register in hopper? | 7 | 507 | July 2, 2024 | |
| Asynchronous copying on hopper GPU from shared to global | 2 | 104 | October 28, 2025 | |
| Coalesced and conflict free memory access using cuda::memcpy_async/cp.async | 6 | 1088 | November 13, 2024 | |
| About async loading | 13 | 379 | March 27, 2025 | |
| How memcpy_async be asynchronous? | 2 | 1044 | June 13, 2024 | |
| About async copy | 9 | 267 | May 8, 2025 | |
| Advanced API Performance: Async Copy | 2 | 745 | January 8, 2025 | |
| Asynchronous copy and Memory allocation for time evolving simulation | 1 | 1271 | June 14, 2012 | |
| Async memory | 3 | 1008 | February 17, 2022 |