Is it possible to combine __pipeline_memcpy_async with ldcs/ldg/__ldca?

tugrul_192bit · November 2, 2025, 12:00pm

For example, I have a code like this:

#if __CUDA_ARCH__ >= 800
                __pipeline_wait_prior(0);
                xf = s_xAsync[thread];
                yf = s_yAsync[thread];
                mass = s_mAsync[thread];
                const int nextItem = index + numTotalThreads;
                if (nextItem < numChunks) {
                    __pipeline_memcpy_async(&s_xAsync[thread], &x[nextItem], sizeof(float4));
                    __pipeline_memcpy_async(&s_yAsync[thread], &y[nextItem], sizeof(float4));
                    __pipeline_memcpy_async(&s_mAsync[thread], &m[nextItem], sizeof(float4));
                    __pipeline_commit();
                }
#else
                xf = __ldcs(&x[index]);
                yf = __ldcs(&y[index]);
                mass = __ldcs(&m[index]);
#endif

without __ldcs for normal access, the __pipeline_memcpy_async version is faster than normal access. When I add __ldcs to normal access, they are getting closer in performance.

Can __pipeline_memcpy_async somehow apply __ldg or __ldcs to the global loads for different use-cases like when data doesn’t fit L2 cache or when its in managed memory maybe peer-to-peer access, for even more performance?

Curefab · November 2, 2025, 3:27pm

You could look at the PTX code to see which instructions are called. I would guess some cp.async variants

It supports cache hint parameters. You can use inline assembly blocks to create your own intrinsics.

tugrul_192bit · November 2, 2025, 4:18pm

Then

cp.async.ca ...

would use all cache levels while

cp.async.cg ...

would pass through only L2 cache. But there’s no hint about streaming (or anything like zero hit-ratio expected).

Does “streaming” (zero hit ratio) usage require explicitly defining host-side launch-config with cache hit ratio given by user, always? For example, like this:

cudaStreamAttrValue stream_attribute;                                         
stream_attribute.accessPolicyWindow.hitRatio  = 0.0;                          
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

rs277 · November 2, 2025, 6:41pm

Does the cache policy “no_allocate" portion of the instruction not apply?

system · November 16, 2025, 6:41pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to control caching behaviour of asynchronous shared memory loads? CUDA Programming and Performance	0	404	April 12, 2023
Ld command is async? CUDA Programming and Performance	3	320	February 1, 2024
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	994	November 13, 2024
How memcpy_async be asynchronous? CUDA Programming and Performance	2	954	June 13, 2024
Some issues regarding the use of prefetch in the cuda kernel CUDA Programming and Performance cuda , kernel	19	369	June 11, 2025
Cp.async introduces bank conflict than naive ldg and sts ptx? CUDA Programming and Performance	3	795	October 3, 2025
Why pipeline built with cuda::memcpy_async is slower than sync implementation? CUDA Programming and Performance cuda , kernel , wsl	0	885	May 10, 2023
__pipeline_memcpy_async VS cuda::memcpy_async CUDA Programming and Performance cuda	0	131	November 26, 2024
Memcpy_async in CUDA Fortran nvc, nvc++ and nvfortran hpc-sdk	6	146	August 8, 2025
No speedup with async shared memory in stencil CUDA Programming and Performance	1	687	July 7, 2021

Is it possible to combine __pipeline_memcpy_async with __ldcs/__ldg/__ldca?

Related topics

Is it possible to combine __pipeline_memcpy_async with ldcs/ldg/__ldca?