CUDA PTX cp.async only supports global to shared memory copy

user98644 · March 14, 2023, 3:53am

The description explicitly says “Operand src specifies a location in the global state space and dst specifies a location in the shared state space.”

Is there an instruction that does the reverse (shared memory to global async copy), and if not, why is that the case?

I see there is a cp.async.bulk operation available since sm_90 and that seems to support more generic src and dst memory types, is that right? (i don’t have a hopper gpu so cannot verify)

striker159 · March 14, 2023, 6:45am

sm_80 and newer provide hardware acceleration for copies from global memory to shared memory via cp.async
The reverse direction is not hardware accelerated so there is no specific instruction for that case.

system · March 28, 2023, 6:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can I use PTX for async memcpy? CUDA Programming and Performance	5	574	October 31, 2023
Asynchronous copying on hopper GPU from shared to global CUDA Programming and Performance	2	84	October 28, 2025
Distributed shared memory asynchronous memory copy CUDA Programming and Performance	4	128	October 1, 2025
Using globalToShmemAsyncCopy CUDA Programming and Performance	5	557	August 4, 2021
Problem about PTX instruction cp.async.ca.shared.global CUDA Programming and Performance	3	2883	September 1, 2022
Can I do async copy from global memory to register in hopper? CUDA Programming and Performance cuda	7	460	July 2, 2024
CUDA PTX cp.async.cg performs differently on Ampere and Hopper CUDA Programming and Performance	8	315	July 4, 2024
Is there a support for copy from shared memory to global memory without using registers? CUDA Programming and Performance cuda	7	370	October 9, 2024
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	1008	November 13, 2024
Fine-grained Address Control in cooperative_groups::memcpy_async CUDA Developer Tools	0	571	February 18, 2021

CUDA PTX cp.async only supports global to shared memory copy

Related topics