Question about cache metrics

Hi,
According to the cache figure in the manual, I expect that L1 misses is the same as L2 accesses. With the following metrics and numbers, that is not the case:


l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_miss.sum    37,554,229
lts__t_sectors_srcunit_tex.sum    40,214,300

Moreover, I expect L2 hit+misses be the same as number of L2 accesses (L1 misses). See these numbers:

lts__t_sectors_srcunit_tex.sum    4,0214,300
lts__t_sectors_srcunit_tex_lookup_hit.sum    38,264,032
lts__t_sectors_srcunit_tex_lookup_miss.sum   2,098,020

But 38264032+2098020 = 40362052.
The difference is small. I don’t know if that is acceptable or the metrics are not correct.
Any idea about that?

P.S: In the figure there is a metric named l1tex__m_gnic2l1tex_read_sectors_mem_lg_op_ld.sum but I don’t see that in the raw output. The device is 3080 and Nsight version is 2022.2.

Most likely, the small variations are due to the multiple replay passes which each collect different metrics and are not always identical. These counts seem within that level of variation.

For the gnic metric, it might be that the metric name has changed. If you hover over the cell in the table, it will tell you what metric we are using for that data. You can use the “–query metrics” flag to see what metrics are available and you can collect them individually with the “–metrics ” flag.

OK thanks. One more question.
In the manual, I see that the --cache-control option comes to effect at replays. So the question is what about inter-kernel events? Does the profiler flush the caches at the beginning of a kernel (or end of the kernel)? If I set --cache-control none, what happens after K1 and before K2? Assume each kernel needs one pass.

If cache control is enabled, the caches are invalidated before the profiled workload is run. When profiling individual kernels, this means they are invalidated before each profiled kernel. This is the case even if only a single pass is needed for this kernel. If cache control is disabled (none), no explicit cache-related actions are taken, neither before nor after kernel launches, independent of the number of replays needed to collect the data.