CUDA PTX cp.async only supports global to shared memory copy

The description explicitly says “Operand src specifies a location in the global state space and dst specifies a location in the shared state space.”

Is there an instruction that does the reverse (shared memory to global async copy), and if not, why is that the case?

I see there is a cp.async.bulk operation available since sm_90 and that seems to support more generic src and dst memory types, is that right? (i don’t have a hopper gpu so cannot verify)

sm_80 and newer provide hardware acceleration for copies from global memory to shared memory via cp.async
The reverse direction is not hardware accelerated so there is no specific instruction for that case.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.