Asynchronous copy of shared memory

Hi, I am reading a blog about asynchronous copy to shared memory (https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/?_gl=1*1plcfms*_gcl_au*MjExOTUxNjc5LjE3MTE1MjA0ODA.)

journey-through-memory-hierarchy-1

The picture in the blog confuses me in that why L1 is involved in the data path? Can we bypass the L1 cache and directly transfer data from L2 to shared memory? Moreover, can we bypass L2 cache and directly transfer data from global memory to shared memory?

No.

A load from global will typically go through the L1. This has been true since the L1 cache was introduced in CUDA GPUs over 10 years ago.

You can with ordinary load activity. It doesn’t appear to me that the CUDA C++ async load mechanism gives explicit controls for this. It may be possible with PTX.

1 Like

Thanks!