Hi, I am reading a blog about asynchronous copy to shared memory (https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/?_gl=1*1plcfms*_gcl_au*MjExOTUxNjc5LjE3MTE1MjA0ODA.)
The picture in the blog confuses me in that why L1 is involved in the data path? Can we bypass the L1 cache and directly transfer data from L2 to shared memory? Moreover, can we bypass L2 cache and directly transfer data from global memory to shared memory?