Is it possible to combine __pipeline_memcpy_async with __ldcs/__ldg/__ldca?

For example, I have a code like this:

#if __CUDA_ARCH__ >= 800
                __pipeline_wait_prior(0);
                xf = s_xAsync[thread];
                yf = s_yAsync[thread];
                mass = s_mAsync[thread];
                const int nextItem = index + numTotalThreads;
                if (nextItem < numChunks) {
                    __pipeline_memcpy_async(&s_xAsync[thread], &x[nextItem], sizeof(float4));
                    __pipeline_memcpy_async(&s_yAsync[thread], &y[nextItem], sizeof(float4));
                    __pipeline_memcpy_async(&s_mAsync[thread], &m[nextItem], sizeof(float4));
                    __pipeline_commit();
                }
#else
                xf = __ldcs(&x[index]);
                yf = __ldcs(&y[index]);
                mass = __ldcs(&m[index]);
#endif

without __ldcs for normal access, the __pipeline_memcpy_async version is faster than normal access. When I add __ldcs to normal access, they are getting closer in performance.

Can __pipeline_memcpy_async somehow apply __ldg or __ldcs to the global loads for different use-cases like when data doesn’t fit L2 cache or when its in managed memory maybe peer-to-peer access, for even more performance?

You could look at the PTX code to see which instructions are called. I would guess some cp.async variants

It supports cache hint parameters. You can use inline assembly blocks to create your own intrinsics.

1 Like

Then

cp.async.ca ...

would use all cache levels while

cp.async.cg ...

would pass through only L2 cache. But there’s no hint about streaming (or anything like zero hit-ratio expected).

Does “streaming” (zero hit ratio) usage require explicitly defining host-side launch-config with cache hit ratio given by user, always? For example, like this:

cudaStreamAttrValue stream_attribute;                                         
stream_attribute.accessPolicyWindow.hitRatio  = 0.0;                          
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

Does the cache policy “no_allocate" portion of the instruction not apply?

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.