The description explicitly says “Operand
src specifies a location in the global state space and
dst specifies a location in the shared state space.”
Is there an instruction that does the reverse (shared memory to global async copy), and if not, why is that the case?
I see there is a
cp.async.bulk operation available since sm_90 and that seems to support more generic src and dst memory types, is that right? (i don’t have a hopper gpu so cannot verify)