Issues about async on A100

half-0 · March 4, 2025, 11:59am

Will async copy be slower then directly load data from gmem through register to smem? How many cycles should I assume that an async copy of 16B per thread directly from gmem, L2 cache & L1 cache? Also, is async copying operation also need to be coalesced?

half-0 · March 5, 2025, 1:46am

Can I just regard async copy the same as direct copy from gmem through reg to shared memory?

Robert_Crovella · March 5, 2025, 1:57am

async copy at least for cc8.0 plus should not involve register usage/staging for the copy. It is a mechanism that can copy “directly” from gmem to smem. I have no idea about faster or slower, and since it is async, the number of cycles it might require would almost certainly depend on what else is going on on the GPU or SM, so asking for a cycle count seems impossible to answer (to me.)

gmem activity always flows through L2. L2 is a device-wide proxy for gmem, at least from the standpoint of activity emanating from the SMs.

half-0 · March 5, 2025, 2:06am

I notice that in A100 whitepaper, the speed of loading data from gmem through register to shared memory is given, but the speed off async copy do not. So I just wonder if it is totally unknown and unpredictable?

Robert_Crovella · March 5, 2025, 2:09am

I don’t know of any specification or documentation for it, and I’ve already indicated why I think it could be variable. I don’t know the bounds of variability.

The GPU is a latency hiding machine. A principle idea of async copy is that you are giving the GPU something else to do, in the presence of other issued work. So it’s purpose is (probably - my guess) not to offer the fastest possible path to move data in any scenario, but instead to provide an asynchronous path that allows you to efficiently schedule a copy operation, making efficient use of GPU capacity.

It allows one to create an asynchronous work pipeline, not unlike the overlap of copy and compute that is considered a fundamental CUDA optimization principle, when copying data to/from the GPU.

Another possible value proposition for async copy - that I mentioned already - is lower register usage/pressure.

half-0 · March 5, 2025, 2:13am

Thanks!

Curefab · March 5, 2025, 5:36am

There are some higher level benchmarks in the Dissecting Hopper paper.

Async. copies seem to be around as fast as synchronous copies under normal distributed usage.

The question is how many SMs have to participate to fully utilize the L2 bandwidth. And how many transfers have to be started per SM.

You could just try out.

half-0 · March 5, 2025, 5:56am

Do async copy need to be coalesced too?

Curefab · March 5, 2025, 7:03am

You need 128 bytes alignment for full performance:

I would not call it coalescing, as there are no threads involved.

half-0 · March 5, 2025, 7:07am

I used to believe that why doing async copying, each thread has its data to load. That’s why I wonder if there will be coalesced need.

Curefab · March 5, 2025, 7:17am

No, it is a different ‘engine’, which can only copy memory (global to shared).

The threads start the engine and it copies in the background.

Later, when you load the data from shared memory, you have threads again with potential banking conflicts.

half-0 · March 5, 2025, 7:43am

Thanks! I got it. So in this aspect, async copy is a bit easy to use then directly loading

Curefab · March 5, 2025, 7:59am

It depends, whether it is easier.

It is asynchronous, so you have to take care of synchronization or piping.

If the memory alignment is 128 bytes anyway, simple load operations over 32 threads with or without storage into shared memory are also quite trivial.

The slight advantages are:

less instructions needed for memory transfer (only if you wanted to save into shared anyway, otherwise you have to load from shared memory and need the instructions therr)
no threads are blocked for compute (easier to hide latencies, but there are not more compute cores overall; improvement similar to how many threads are used per SM due to number of used register per thread)
perhaps easier to do asynchronous transfers than with dedicated warps for copying instead (easier because of available examples, but quite similar)
the engine continuously transfers, so potentially better memory performance (threads can be interrupted during the copy by other warps, the engine not)
not sure, whether the copy engine has additional performance improvements as it knows in advance that a large block is transferred

half-0 · March 5, 2025, 8:09am

I have a question about alignment to 128B. As I can load less then16B per instruction, how can I satisfy the alignment requirement?

half-0 · March 5, 2025, 8:11am

Does the alignment means, at the same moment, threads ask engine for 128B aligned data when sum up?

Curefab · March 5, 2025, 8:18am

Alignment only concerns the start of the address. Not the amount to be copied.

Also the 128 bytes are a performance requirement.
I would expect it to also work fast enough with 32 bytes alignment, but please try out.

half-0 · March 5, 2025, 8:22am

However, if I need to load 64B data to smem, as I can only load 16B per instruction, there is no way to align to 128B for each instruction.

Curefab · March 5, 2025, 8:33am

I would not see asynchronous copies as first choice for loading only 64 bytes.

I would use it for 128 bytes to several KB per request.

But you can try for your specific case.

half-0 · March 5, 2025, 8:37am

Yes, I don’ t just load 64B, it is just an example. But if I need to load 4KB data, I still have to load them 16B by 16B. So there will always be lot of times the addr in shared memory and global memory are not aligned to 128B (or even 32B). If I load 16B per time, I can only ensure the addr align to 16B.

Curefab · March 5, 2025, 9:04am

I think the async copy engine is at least as strict (regarding alignment, size, etc.) as directly loading from global memory.

Topic		Replies	Views
The difference between Asynchronous Copy and Synchronous Copy CUDA Programming and Performance cuda , ampere	7	894	November 5, 2024
Can I do async copy from global memory to register in hopper? CUDA Programming and Performance cuda	7	507	July 2, 2024
Asynchronous copying on hopper GPU from shared to global CUDA Programming and Performance	2	104	October 28, 2025
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	1086	November 13, 2024
About async loading CUDA Programming and Performance	13	379	March 27, 2025
How memcpy_async be asynchronous? CUDA Programming and Performance	2	1044	June 13, 2024
About async copy CUDA Programming and Performance	9	266	May 8, 2025
Advanced API Performance: Async Copy Technical Blog	2	745	January 8, 2025
Asynchronous copy and Memory allocation for time evolving simulation CUDA Programming and Performance	1	1271	June 14, 2012
Async memory CUDA Programming and Performance	3	1008	February 17, 2022

Issues about async on A100

Related topics