However, Is it correct to understand that another different is that, in Asyn copy, threads also show an asynchronous behavior where threads does not wait for other threads, one thread fetching data then do the individual conputation? In Syn copy, threads should wait for the slowest thread to complete fetching data and they do computation together?
How can I know whether in A100 the asyn copy is used when just calling the gemm function from cublas library?
The asynchronous copy is similar to a DMA controller:
The data is copied without explicit instructions for each load operation.
And it is copied directly into shared memory instead of into the register file.
The “Synchronous Copy” is also asynchronous insofar the threads continue the next instructions until the instruction, which uses the register, where the read values should appear in. Only then is the warp not longer eligible for scheduling. Those blocking transactions are registered for some registers with the long scoreboard for global memory accesses. You can find that as possible blocking reason in Compute Nsight.
In Asynchronous Copy the threads do not directly wait, as the result goes into shared memory. They have to explicitly wait for the asynchronous copy to finish.
You can use Compute Nsight to see, which kind of copies are used, even with library kernels.
Thanks for your explainaton. But I am still confused that it seems to me memcpy_async should be used with pipeline so that the latency can be overlapped with computation. If just using cuda::memcpy_async with barrier, it looks the same with Synchronous Copy?
The difference is that a synchronous copy is actually processed by the warp(s) needing scheduling resources, whereas a barrier blocks the warp(s) and let other warps use the computation time for something useful.
Thanks, but I am not sure if i understand correctly. So can i say that for synchronous copy, threads in a warp (even for threads in one block) should be synchronized before computation, but for asynchronous copy, it is warp-level, where a warp can load data and in the meantime, other warps can do computation (using barrier?) ?
The synchronous copy (at least as I understand the term as you have used it) means that some threads (one warp or several warps) use their computation time to manually copy data (e.g. from global memory to registers or to shared memory).
The other warps can do something else in the meantime, but still some resources for scheduling are used and the copying warps themselves are busy and cannot do computations at that time.
With memcpy_async, the copy is done by a special engine in the background. So all warps can do computational tasks in the meantime. And also the copy is running more smoothly, as the engine is dedicated for copying, whereas in the case before the execution switches between the warps and the copying warps are only intermittently active.
Whether you need synchronization and when depends on, which thread needs the data and whether you are double buffering or using a ring buffer or just have a single buffer.
All this synchronizing of threads described in this paragraph (for the way it was done before asynchronous copy) is just one way of processing data and dividing the work between warps.
With asynchronous copy threads have to initiate the asynchronous copy and before the data is processed, it has to wait (or at least make sure) that the data it needs, has already been copied.
The copy itself does not consume the threads resources. Actually it is the resources of the SM Partition, which switches between warps, which is not used. (Except the global and shared memory bandwidth of course). But, yes, the thread itself can just continue to compute other things (or the previous iteration of the data).
Asynchronous copy is like if you have an additional warp, dedicated to copying from global to shared memory and the other warps can send to it work packages (for copying data). And this additional warp is activate all the time instead of taking time slices from the other warps.
This last paragraph is just for understanding/illustration. But before Ampere (before asynchronous copy) it was possible to do just that with an actual warp - dedicate a warp for copying and the other warps for computation.