Hi, I noticed that cp.async
can do things from global memory to shared memory under the architecture before Ada Lovelance. Can I do something like cp.async
from global memory to register in Hopper or Ada?
It would be so appreciated if anyone could help!
No, cp.async
does not directly load into registers. But just normal load (memory access) instructions are also asynchronous and transfer from global memory to registers.
Thanks for your reply! I suppose that each thread can only do one thing at a time, when it is doing loading, it can not do other things. So, how to understand the asynchronous mechanism in loading from global memory to register? Does it mean that I can overlap some computation when loading data from global memory to register? And how can I deliver experiment to verify it?
All threads just initiate computations or loads or stores. The threads have to wait only if they a) use the result, e.g. for another computation, or b) if the pipeline is full, e.g. too many stores in flight.
There is an internal list of registers (scoreboard), which still wait to be filled from the results of previous instructions.
There is no very simple experiment to verify it, as the numeric results are the same. You can only see it by profiling the higher execution speed. For it you could use the clock functions (1. Introduction — CUDA C Programming Guide) or the Nsight Compute profiler to show that Cuda actually waits at the arithmetic instruction using the loaded value instead of at the instruction, which loads the value.
Thank you so much! It really helps a lot. I will try to figure it out using profiler.
BTW just as a remark, if you transfer a lot of data, cp.async
can still help, as
- it is more difficult to asynchronously load from high-latency global memory instead of from low-latency shared memory to the registers
- the number of registers and the number of in-flight transactions is limited
- the memory accesses from global memory are coalesced
Depending on your kernel and your access profile, cp.async
could also lead to no improvements or a slow-down, especially if your kernel needs the data and the processors have nothing else to do until the data has arrived.
I got it! Thank you so much again~
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.