L2 hit rate more than 100%

Hi
I see a topic about L2 hit rates larger than 100% and I see such numbers in my analysis too. The numbers look like

lts__t_sector_hit_rate.pct      130%
launch__thread_count         5120
launch__grid_size                  10
launch__block_size          512 
gpc__cycles_elapsed.avg       20097.333333
gpu__time_duration.sum       0.014016 (mseconds)
smsp__inst_executed.sum        304950
profiler__replayer_passes         3

From the previous topic, I understand that for short kernels (short time or small grid size) and with respect to multiple passes, the hit rate is maybe inaccurate. But specifically I want to know can I assume 1 pass hit rate is 43.3% because 3 passes is 130%?

No, you cannot make that assumption. The passes collect different metrics. For example 1000 accesses and 940 hits would give you a 94% hit rate. But if accesses and hits are collected in different passes and the the run is so short that it doesn’t saturate the GPU/reach steady state as described in the other thread, the ratio may get messed up. It isn’t divisible by the number of passes.

Some NVIDIA GPUs can support collecting lts__t_sector_hit_rate.pct in a single pass. The GPU you are running on required 2 passes to collect the metrics + 1 pass to optimize the data restore.

lts__t_sector_hit_rate = (lts__t_sectors_lookup_hit.avg) / (lts__t_sectors.avg) * 100.

The reason that multi-pass is problematic is that the two metrics can be collected in different passes.

NCU --cache-control defaults to all (caches are flushed and invalidated). This helps but there can still be a lot of race conditions on small kernels that may cause incorrect results.

The metric listed above is for all L2 accesses. It does not filter to SM L1TEX. As such one pass may get data traffic from another independent engine or units such as copy engines, display, instruction cache, and MMU can drive small differences in values that for a small grid launch can result in out of bounds or inconsistent values.

You may get more deterministic results by calculating your own hit rate as

CALCULATED_lts__t_sectors_srcunit_tex_lookup_hit_rate.pct = lts__t_sectors_srcunit_tex_lookup_hit.sum / (lts__t_sectors_srcunit_tex_lookup_hit.sum + lts__t_sectors_srcunit_tex_lookup_miss.sum) * 100.

1 Like

If I understand correctly, you are saying that calculating hits and misses and filtering L1TEX is a better idea. Even

lts__t_sectors_srcunit_tex_lookup_hit.sum
lts__t_sectors_srcunit_tex_lookup_miss.sum

require 3 passes. Isn’t that a problem similar to that hit rate?

Using srcunit_tex will remove some temporal data accesses from other engines such as the copy engine and may remove inconsistent MMU accesses. This is still not 100% accurate on systems that cannot collected both hit and miss on the same pass.

“require 3 passes”
If metrics take more than 1 pass then NCU will add an additional pass to try to reduce cost of state restore.
Pass 1 - NCU optimizes what data needs to be restored
Pass 2 - collect hit
Pass 3 - collect miss

Some GPUs support collection of hit and miss in the same pass.

For many metrics you need to saturate the GPU in order to resolve the multi-pass issue.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.