The difference between Asynchronous Copy and Synchronous Copy

andyCorrin · November 1, 2024, 1:21am

Hello there
According to https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#asynchronous-copy-from-global-memory-to-shared-memory, the difference between Asynchronous Copy and Synchronous Copy I think is Ascyn copy bypasses RF and maybe L1 in some conditions.

However, Is it correct to understand that another different is that, in Asyn copy, threads also show an asynchronous behavior where threads does not wait for other threads, one thread fetching data then do the individual conputation? In Syn copy, threads should wait for the slowest thread to complete fetching data and they do computation together?

How can I know whether in A100 the asyn copy is used when just calling the gemm function from cublas library?

Really appreciated it if you could help.

Curefab · November 1, 2024, 10:31am

The asynchronous copy is similar to a DMA controller:
The data is copied without explicit instructions for each load operation.
And it is copied directly into shared memory instead of into the register file.

The “Synchronous Copy” is also asynchronous insofar the threads continue the next instructions until the instruction, which uses the register, where the read values should appear in. Only then is the warp not longer eligible for scheduling. Those blocking transactions are registered for some registers with the long scoreboard for global memory accesses. You can find that as possible blocking reason in Compute Nsight.

In Asynchronous Copy the threads do not directly wait, as the result goes into shared memory. They have to explicitly wait for the asynchronous copy to finish.

You can use Compute Nsight to see, which kind of copies are used, even with library kernels.

andyCorrin · November 4, 2024, 8:32am

Thanks for your explainaton. But I am still confused that it seems to me memcpy_async should be used with pipeline so that the latency can be overlapped with computation. If just using cuda::memcpy_async with barrier, it looks the same with Synchronous Copy?

Curefab · November 4, 2024, 12:27pm

The difference is that a synchronous copy is actually processed by the warp(s) needing scheduling resources, whereas a barrier blocks the warp(s) and let other warps use the computation time for something useful.

andyCorrin · November 5, 2024, 12:07pm

Thanks, but I am not sure if i understand correctly. So can i say that for synchronous copy, threads in a warp (even for threads in one block) should be synchronized before computation, but for asynchronous copy, it is warp-level, where a warp can load data and in the meantime, other warps can do computation (using barrier?) ?

Curefab · November 5, 2024, 12:15pm

The synchronous copy (at least as I understand the term as you have used it) means that some threads (one warp or several warps) use their computation time to manually copy data (e.g. from global memory to registers or to shared memory).
The other warps can do something else in the meantime, but still some resources for scheduling are used and the copying warps themselves are busy and cannot do computations at that time.

With memcpy_async, the copy is done by a special engine in the background. So all warps can do computational tasks in the meantime. And also the copy is running more smoothly, as the engine is dedicated for copying, whereas in the case before the execution switches between the warps and the copying warps are only intermittently active.

Whether you need synchronization and when depends on, which thread needs the data and whether you are double buffering or using a ring buffer or just have a single buffer.

andyCorrin · November 5, 2024, 12:23pm

With memcpy_async, the copy is done by a special engine in the background.

It means the copy does not consume the threads resources at all?
https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/
This link gievs an asynchronous copy example. In Figure3, it seems for each batch, a thread fetches data and performs computation?

Curefab · November 5, 2024, 12:37pm

All this synchronizing of threads described in this paragraph (for the way it was done before asynchronous copy) is just one way of processing data and dividing the work between warps.

With asynchronous copy threads have to initiate the asynchronous copy and before the data is processed, it has to wait (or at least make sure) that the data it needs, has already been copied.

The copy itself does not consume the threads resources. Actually it is the resources of the SM Partition, which switches between warps, which is not used. (Except the global and shared memory bandwidth of course). But, yes, the thread itself can just continue to compute other things (or the previous iteration of the data).

Asynchronous copy is like if you have an additional warp, dedicated to copying from global to shared memory and the other warps can send to it work packages (for copying data). And this additional warp is activate all the time instead of taking time slices from the other warps.

This last paragraph is just for understanding/illustration. But before Ampere (before asynchronous copy) it was possible to do just that with an actual warp - dedicate a warp for copying and the other warps for computation.

Topic		Replies	Views
Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture Technical Blog	0	495	September 23, 2020
Issues about async on A100 CUDA Programming and Performance	22	45	March 19, 2025
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1065	December 15, 2022
about the __syncwarp() in P100 CUDA Programming and Performance	11	4067	June 6, 2018
About async copy CUDA Programming and Performance	9	45	May 8, 2025
cudaMemcpyAsync problem CUDA Programming and Performance	9	3117	May 26, 2020
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2635	April 9, 2012
Performance of memcpyasync CUDA Programming and Performance	2	1114	June 17, 2021
Async Memcpy calls blocking main thread CUDA Programming and Performance	3	2445	November 19, 2011
Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync() CUDA Programming and Performance	5	3696	December 23, 2023

The difference between Asynchronous Copy and Synchronous Copy

Related topics