Why L2 to L1 and L2 to shared memory different？

137095576 · February 6, 2023, 9:26am

I use a fc op, find the time of L2 to L1 and L2 to shared memory is slow，
but why the L2 to L1 more slow?
whether L2 → L1 → shared memory? but the L1 to shared menory is 0B

jmarusarz · February 14, 2023, 8:37pm

Can you share a Nsight Compute result with some more details? It’s not clear to me exactly what you’re seeing and how to clarify it. Thanks.

137095576 · February 15, 2023, 1:41am

I profile a full connection layer on A100.
The above are the results of fp16 and fp32 respectively
in fp16, there is a line from L2 to shared memory with 191.66B, but in fp32, this line always 0B, all data trans to L1.
In my consciousness， L1 and Shared memory are the same piece of hardware, when it trans to L1 ? when it trans to shared memory?
thanks.

jmarusarz · February 16, 2023, 10:18pm

The L-shaped path from L2 to Shared Memory represents the LDGSTS instruction path, i.e. “Asynchronous Global to Shared Memcopy”. It’s hard to say why fp16 is using this path and fp32 is not. It could be that the compiler generated instructions to use this path in one case and not the other. Or if you’re using library code, maybe it was implemented that way. You should be able to look at the SASS assembly in Nsight Compute and see the difference in the SASS instructions. But why code was generated that way is difficult to say with only this information.

Topic		Replies	Views
What's the difference between L1 cache and the shared memory CUDA Programming and Performance	4	14573	October 29, 2011
Shared memory of SM CUDA Programming and Performance	1	397	October 31, 2019
Meanings of L2 --> L2 copy Nsight Compute	1	662	January 17, 2022
How to use L2 compression? How to send L1D to shared memory? CUDA Programming and Performance	8	1179	December 31, 2023
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1159	April 26, 2013
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1448	August 28, 2019
Issues about L1 cache CUDA Programming and Performance	10	63	February 26, 2025
Roofline model's different chart's understanding Nsight Compute	0	1486	March 24, 2024
Cache line size of L1 and L2 CUDA Programming and Performance	3	20626	November 14, 2011
How to interpret the difference between LSU utilization and Shared Memory utilization (in case of shared memory access only)? Nsight Compute cuda , kernel	0	572	July 6, 2022

Why L2 to L1 and L2 to shared memory different？

Related topics