The description explicitly says “Operand src
specifies a location in the global state space and dst
specifies a location in the shared state space.”
Is there an instruction that does the reverse (shared memory to global async copy), and if not, why is that the case?
I see there is a cp.async.bulk
operation available since sm_90 and that seems to support more generic src and dst memory types, is that right? (i don’t have a hopper gpu so cannot verify)