Asynchronous copy of shared memory

qwerty00 · June 3, 2024, 6:23am

Hi, I am reading a blog about asynchronous copy to shared memory (https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/?_gl=1*1plcfms*_gcl_au*MjExOTUxNjc5LjE3MTE1MjA0ODA.)

journey-through-memory-hierarchy-1

The picture in the blog confuses me in that why L1 is involved in the data path? Can we bypass the L1 cache and directly transfer data from L2 to shared memory? Moreover, can we bypass L2 cache and directly transfer data from global memory to shared memory?

Robert_Crovella · June 4, 2024, 3:20pm

No.

A load from global will typically go through the L1. This has been true since the L1 cache was introduced in CUDA GPUs over 10 years ago.

You can with ordinary load activity. It doesn’t appear to me that the CUDA C++ async load mechanism gives explicit controls for this. It may be possible with PTX.

qwerty00 · June 5, 2024, 4:57am

Thanks!

Topic		Replies	Views
What is the meaning of the word kernel in the memory workload analysis Nsight Compute kernel	5	921	May 14, 2024
Cache behavior when loading global data to shared memory in Fermi CUDA Programming and Performance	1	1082	April 30, 2013
Is it possible to use L1 cache instead of shared memory when implementing blocked matmuls in CUDA CUDA Programming and Performance	4	1529	June 18, 2023
How to control caching behaviour of asynchronous shared memory loads? CUDA Programming and Performance	0	408	April 12, 2023
Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture Technical Blog	0	546	September 23, 2020
How to optimize for cache + shared memory on Fermi? CUDA Programming and Performance	8	3191	April 25, 2010
Distributed shared memory asynchronous memory copy CUDA Programming and Performance	3	147	September 17, 2025
About async loading CUDA Programming and Performance	12	389	March 13, 2025
Issues about async on A100 CUDA Programming and Performance	21	463	March 5, 2025
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23769	March 21, 2011

Asynchronous copy of shared memory

Related topics