Can I use PTX for async memcpy?

202476410arsmart · October 16, 2023, 4:50pm

I see this somewhere. I am using ampere structure, not sure whether this can avoid register for data moving from global to shared? Thanks!!!

__device__ __forceinline__
void ldgsts32(const uint32_t &smem_addr,
              const void *gmem_ptr) {
    asm volatile (
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDACC_VER_MINOR__ >= 4
        " cp.async.ca.shared.global.L2::128B [%0], [%1], 4;\n"
#else
        " cp.async.ca.shared.global [%0], [%1], 4;\n"
#endif
        : : "r"(smem_addr), "l"(gmem_ptr)
    );
}

The reason I am considering it but not memcpy_async is I need strided read data! But seems memcpy_async must read a continuous data block…

Robert_Crovella · October 16, 2023, 4:59pm

this may be of interest

202476410arsmart · October 17, 2023, 12:12am

Emmm…So can this code enjoy hardware acceleration? (Save register to move data?)

Robert_Crovella · October 17, 2023, 4:05am

The level of benefit may depend to some degree on the architecture. I haven’t really studied the PTX guide for these factoids, but the programming manual and if memory serves the A100 whitepaper gives some description of the benefits, at least with respect to the CUDA C++ version/intrinsics.

I think a very basic perusal of the generated SASS will identify what the register usage is. AFAIK, for cc 8.0 and beyond, it should not require registers to store the data in-flight from global to shared. There are registers used, of course, to indicate addresses and so forth.

202476410arsmart · October 17, 2023, 6:10am

Haha, frankly speaking, I am not that clever to read SASS. Therefore I think it should save register. Thanks!!!

Topic		Replies	Views
CUDA PTX cp.async only supports global to shared memory copy CUDA Programming and Performance cuda , performance	2	1316	March 14, 2023
Is there a support for copy from shared memory to global memory without using registers? CUDA Programming and Performance cuda	7	400	October 9, 2024
Can I do async copy from global memory to register in hopper? CUDA Programming and Performance cuda	7	498	July 2, 2024
Asynchronous copying on hopper GPU from shared to global CUDA Programming and Performance	2	104	October 28, 2025
Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture Technical Blog	0	543	September 23, 2020
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	1074	November 13, 2024
Fine-grained Address Control in cooperative_groups::memcpy_async CUDA Developer Tools	0	582	February 18, 2021
Global memory to shared memory without passing registers CUDA-GDB	1	557	February 3, 2021
Problem about PTX instruction cp.async.ca.shared.global CUDA Programming and Performance	3	2930	September 1, 2022
No speedup with async shared memory in stencil CUDA Programming and Performance	1	700	July 7, 2021

Can I use PTX for async memcpy?

Related topics