Can I use PTX for async memcpy?

I see this somewhere. I am using ampere structure, not sure whether this can avoid register for data moving from global to shared? Thanks!!!

__device__ __forceinline__
void ldgsts32(const uint32_t &smem_addr,
              const void *gmem_ptr) {
    asm volatile (
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDACC_VER_MINOR__ >= 4
        " cp.async.ca.shared.global.L2::128B [%0], [%1], 4;\n"
#else
        " cp.async.ca.shared.global [%0], [%1], 4;\n"
#endif
        : : "r"(smem_addr), "l"(gmem_ptr)
    );
}

The reason I am considering it but not memcpy_async is I need strided read data! But seems memcpy_async must read a continuous data block…

this may be of interest

Emmm…So can this code enjoy hardware acceleration? (Save register to move data?)

The level of benefit may depend to some degree on the architecture. I haven’t really studied the PTX guide for these factoids, but the programming manual and if memory serves the A100 whitepaper gives some description of the benefits, at least with respect to the CUDA C++ version/intrinsics.

I think a very basic perusal of the generated SASS will identify what the register usage is. AFAIK, for cc 8.0 and beyond, it should not require registers to store the data in-flight from global to shared. There are registers used, of course, to indicate addresses and so forth.

Haha, frankly speaking, I am not that clever to read SASS. Therefore I think it should save register. Thanks!!!