How to avoid calculate L1 Store into L2 HIT

cosmolee1022 · January 12, 2025, 11:19am

Hello everyone,

I’m learning Cuda as well, here is one thing makes me confuse that all store requests are calcuated as L2 HIT. This behavior makes me hard to optimize reads which that are really cared

Can anybody tell me how to avoid ‘L1/TEX store’ to be peered as L2 HIT? I tried several solutions like ‘cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);’ or compile kernel using ‘-Xptxas -dlcm=cg’ or ‘–cache–control off’ in ncu. NONE of them work
tried using asm “st.global.cs.f32 [%0], %1;” (sm_89). It should be a streaming instruction and avoid L1 and L2 totally. But the L1/TEX store is still peered as HIT in L2
Thanks a lot!

Greg · January 13, 2025, 11:00pm

SM load/stores, with exception to TMA, go through the L1 tag stage. This is the stage that reports hits/misses. Even if cache streaming a hit/miss will be reported but an allocate may not occur.

L2 is the point of coherence in NVIDIA GPUs so all operations through L2 go through the L2 tag stage which will report a hit or miss. Since L2 is the point of coherence for device memory there is no guarantee that a STG will ever reach device memory. The same is not true for SYSMEM or PEERMEM.

cosmolee1022 · January 14, 2025, 11:32am

Thanks, Greg. Your explanation to me is like say yes and no together. Still I think this setting , peer write as l2 hit, is confusing during my optimization work.
Can you give me a simple solution to modify this metric to exclude any writes? I didn’t find a correct way on modifying metrics till now. It’s too complicated

Topic		Replies	Views
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	2690	July 3, 2024
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1362	April 25, 2020
L2 cache hit rate of a streaming kernel is not as expected profiled in ncu CUDA Programming and Performance nsight	2	919	March 22, 2023
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	9	53	January 8, 2025
L2 hit rate more than 100% Nsight Compute	5	1206	March 23, 2023
Memory transaction size CUDA Programming and Performance	1	1716	February 12, 2017
What happens if a global store misses the L2 cache? CUDA Programming and Performance	0	1105	January 10, 2017
L2 cache read misses vs L2 cache write misses CUDA Programming and Performance	5	2447	February 5, 2014
L1 and L2 cache hit rate CUDA Programming and Performance	8	6481	February 3, 2016
Mismatch in L2 load miss and Device Memory loads Nsight Compute	2	409	March 20, 2024

How to avoid calculate L1 Store into L2 HIT

Related topics