Will async copy be slower then directly load data from gmem through register to smem? How many cycles should I assume that an async copy of 16B per thread directly from gmem, L2 cache & L1 cache? Also, is async copying operation also need to be coalesced?
Can I just regard async copy the same as direct copy from gmem through reg to shared memory?
async copy at least for cc8.0 plus should not involve register usage/staging for the copy. It is a mechanism that can copy “directly” from gmem to smem. I have no idea about faster or slower, and since it is async, the number of cycles it might require would almost certainly depend on what else is going on on the GPU or SM, so asking for a cycle count seems impossible to answer (to me.)
gmem activity always flows through L2. L2 is a device-wide proxy for gmem, at least from the standpoint of activity emanating from the SMs.
I notice that in A100 whitepaper, the speed of loading data from gmem through register to shared memory is given, but the speed off async copy do not. So I just wonder if it is totally unknown and unpredictable?
I don’t know of any specification or documentation for it, and I’ve already indicated why I think it could be variable. I don’t know the bounds of variability.
The GPU is a latency hiding machine. A principle idea of async copy is that you are giving the GPU something else to do, in the presence of other issued work. So it’s purpose is (probably - my guess) not to offer the fastest possible path to move data in any scenario, but instead to provide an asynchronous path that allows you to efficiently schedule a copy operation, making efficient use of GPU capacity.
It allows one to create an asynchronous work pipeline, not unlike the overlap of copy and compute that is considered a fundamental CUDA optimization principle, when copying data to/from the GPU.
Another possible value proposition for async copy - that I mentioned already - is lower register usage/pressure.
Thanks!
There are some higher level benchmarks in the Dissecting Hopper paper.
Async. copies seem to be around as fast as synchronous copies under normal distributed usage.
The question is how many SMs have to participate to fully utilize the L2 bandwidth. And how many transfers have to be started per SM.
You could just try out.
Do async copy need to be coalesced too?
You need 128 bytes alignment for full performance:
I would not call it coalescing, as there are no threads involved.
I used to believe that why doing async copying, each thread has its data to load. That’s why I wonder if there will be coalesced need.
No, it is a different ‘engine’, which can only copy memory (global to shared).
The threads start the engine and it copies in the background.
Later, when you load the data from shared memory, you have threads again with potential banking conflicts.
Thanks! I got it. So in this aspect, async copy is a bit easy to use then directly loading
It depends, whether it is easier.
It is asynchronous, so you have to take care of synchronization or piping.
If the memory alignment is 128 bytes anyway, simple load operations over 32 threads with or without storage into shared memory are also quite trivial.
The slight advantages are:
- less instructions needed for memory transfer (only if you wanted to save into shared anyway, otherwise you have to load from shared memory and need the instructions therr)
- no threads are blocked for compute (easier to hide latencies, but there are not more compute cores overall; improvement similar to how many threads are used per SM due to number of used register per thread)
- perhaps easier to do asynchronous transfers than with dedicated warps for copying instead (easier because of available examples, but quite similar)
- the engine continuously transfers, so potentially better memory performance (threads can be interrupted during the copy by other warps, the engine not)
- not sure, whether the copy engine has additional performance improvements as it knows in advance that a large block is transferred
I have a question about alignment to 128B. As I can load less then16B per instruction, how can I satisfy the alignment requirement?
Does the alignment means, at the same moment, threads ask engine for 128B aligned data when sum up?
Alignment only concerns the start of the address. Not the amount to be copied.
Also the 128 bytes are a performance requirement.
I would expect it to also work fast enough with 32 bytes alignment, but please try out.
However, if I need to load 64B data to smem, as I can only load 16B per instruction, there is no way to align to 128B for each instruction.
I would not see asynchronous copies as first choice for loading only 64 bytes.
I would use it for 128 bytes to several KB per request.
But you can try for your specific case.
Yes, I don’ t just load 64B, it is just an example. But if I need to load 4KB data, I still have to load them 16B by 16B. So there will always be lot of times the addr in shared memory and global memory are not aligned to 128B (or even 32B). If I load 16B per time, I can only ensure the addr align to 16B.
I think the async copy engine is at least as strict (regarding alignment, size, etc.) as directly loading from global memory.