Ampere GPU L2 cache write miss policy

LibAndLab · December 28, 2021, 3:36am

i am trying to understand the ncu metrics：lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit And lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss，so i did some test on 3080：
test function, only start 32 thread ：

test1 kernel：

test1 result：

test2 kernel：
global void LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel(float *input, float *output, float *temp) {

input[threadIdx.x] = input[threadIdx.x]*1.5f;

}
test2 result：
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit_miss test
==PROF== Connected to process 13248 (/home/yongjian/CUDA_Note/build/bin/test_lts_sector_GPC)
==PROF== Profiling “LtsTSectorsSrcnodeGpcAperture…” - 1: 0%…50%…100% - 1 pass
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit_miss test finished
==PROF== Disconnected from process 13248
[13248] test_lts_sector_GPC@127.0.0.1
LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel(float *, float *, float *), 2021-Dec-28 11:10:14, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.avg sector 0,10
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.max sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.min sector 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.sum sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.avg sector 0,10
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.max sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.min sector 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.sum sector 4
---------------------------------------------------------------------- --------------- ------------------------------

my point of view：
in test2 ,kernel read 32 float data(4 sector load miss), and multiply 1.5 ,then write to origin position, because the data already cache in L2 ,so the store operation will write hit, so there will be 4 sector store hit, the ncu results are the same as my guess。
but in test1，kernel just read 32 float data(4 sector load miss)， then write to another position， but the result is also 4 sector miss and 4 sector hit， this really confused me，when kernel store the 32 float data,it will write miss, what will the L2 cache do? why the lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit still increase 4 sector hit?
can anyone help me? thank you very much!

LibAndLab · February 8, 2022, 7:19am

can anyone help me？

Robert_Crovella · February 8, 2022, 4:19pm

You might get better responses asking ncu questions on the ncu forum.

LibAndLab · February 9, 2022, 3:29am

ok, thanks for reply

Topic		Replies	Views
Metrics about sysmem access with L2 cache Nsight Compute	0	471	February 9, 2022
Why atomic operation on sysmem always miss CUDA Programming and Performance	7	799	February 15, 2022
L2 cache read misses vs L2 cache write misses CUDA Programming and Performance	5	2446	February 5, 2014
Mismatch in L2 load miss and Device Memory loads Nsight Compute	2	407	March 20, 2024
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	2592	July 3, 2024
L2 cache hit rate of a streaming kernel is not as expected profiled in ncu CUDA Programming and Performance nsight	2	914	March 22, 2023
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1356	April 25, 2020
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	1	13	December 10, 2024
Question about cache metrics Nsight Compute	3	637	March 10, 2023
Ncu profiling l2 cache compression rate CUDA Programming and Performance	2	1225	March 4, 2022

Ampere GPU L2 cache write miss policy

Related topics