For example, I have a code like this:
#if __CUDA_ARCH__ >= 800
__pipeline_wait_prior(0);
xf = s_xAsync[thread];
yf = s_yAsync[thread];
mass = s_mAsync[thread];
const int nextItem = index + numTotalThreads;
if (nextItem < numChunks) {
__pipeline_memcpy_async(&s_xAsync[thread], &x[nextItem], sizeof(float4));
__pipeline_memcpy_async(&s_yAsync[thread], &y[nextItem], sizeof(float4));
__pipeline_memcpy_async(&s_mAsync[thread], &m[nextItem], sizeof(float4));
__pipeline_commit();
}
#else
xf = __ldcs(&x[index]);
yf = __ldcs(&y[index]);
mass = __ldcs(&m[index]);
#endif
without __ldcs for normal access, the __pipeline_memcpy_async version is faster than normal access. When I add __ldcs to normal access, they are getting closer in performance.
Can __pipeline_memcpy_async somehow apply __ldg or __ldcs to the global loads for different use-cases like when data doesn’t fit L2 cache or when its in managed memory maybe peer-to-peer access, for even more performance?