Issues about async on A100

half-0 · March 5, 2025, 9:11am

But directly loading from global memory doesn’t need all addr aligns to 128B, it only have coalesced requirements. For example, if 32 threads in a warp need to load 128B data together, thread i can use addr only aligned to 4B and get the best efficiency. But for async copy, even a thread can require for 16B. If some threads need to load 128B together, they need to use addr 0, 16, 32, 48, 64, 80, 96, 112 and endure the loss in efficiency. Do the alignment requirements act in the way I demonstrate in the example?

Curefab · March 5, 2025, 10:27am

After you wrote, I am not so sure anymore myself.

Topic		Replies	Views
The difference between Asynchronous Copy and Synchronous Copy CUDA Programming and Performance cuda , ampere	7	894	November 5, 2024
Can I do async copy from global memory to register in hopper? CUDA Programming and Performance cuda	7	507	July 2, 2024
Asynchronous copying on hopper GPU from shared to global CUDA Programming and Performance	2	104	October 28, 2025
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	1088	November 13, 2024
About async loading CUDA Programming and Performance	13	379	March 27, 2025
How memcpy_async be asynchronous? CUDA Programming and Performance	2	1044	June 13, 2024
About async copy CUDA Programming and Performance	9	267	May 8, 2025
Advanced API Performance: Async Copy Technical Blog	2	745	January 8, 2025
Asynchronous copy and Memory allocation for time evolving simulation CUDA Programming and Performance	1	1271	June 14, 2012
Async memory CUDA Programming and Performance	3	1008	February 17, 2022

Issues about async on A100

Related topics