I’m learning Cuda as well, here is one thing makes me confuse that all store requests are calcuated as L2 HIT. This behavior makes me hard to optimize reads which that are really cared
Can anybody tell me how to avoid ‘L1/TEX store’ to be peered as L2 HIT? I tried several solutions like ‘cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);’ or compile kernel using ‘-Xptxas -dlcm=cg’ or ‘–cache–control off’ in ncu. NONE of them work
tried using asm “st.global.cs.f32 [%0], %1;” (sm_89). It should be a streaming instruction and avoid L1 and L2 totally. But the L1/TEX store is still peered as HIT in L2
Thanks a lot!
SM load/stores, with exception to TMA, go through the L1 tag stage. This is the stage that reports hits/misses. Even if cache streaming a hit/miss will be reported but an allocate may not occur.
L2 is the point of coherence in NVIDIA GPUs so all operations through L2 go through the L2 tag stage which will report a hit or miss. Since L2 is the point of coherence for device memory there is no guarantee that a STG will ever reach device memory. The same is not true for SYSMEM or PEERMEM.
Thanks, Greg. Your explanation to me is like say yes and no together. Still I think this setting , peer write as l2 hit, is confusing during my optimization work.
Can you give me a simple solution to modify this metric to exclude any writes? I didn’t find a correct way on modifying metrics till now. It’s too complicated