From the previous topic, I understand that for short kernels (short time or small grid size) and with respect to multiple passes, the hit rate is maybe inaccurate. But specifically I want to know can I assume 1 pass hit rate is 43.3% because 3 passes is 130%?
No, you cannot make that assumption. The passes collect different metrics. For example 1000 accesses and 940 hits would give you a 94% hit rate. But if accesses and hits are collected in different passes and the the run is so short that it doesn’t saturate the GPU/reach steady state as described in the other thread, the ratio may get messed up. It isn’t divisible by the number of passes.
The reason that multi-pass is problematic is that the two metrics can be collected in different passes.
NCU --cache-control defaults to all (caches are flushed and invalidated). This helps but there can still be a lot of race conditions on small kernels that may cause incorrect results.
The metric listed above is for all L2 accesses. It does not filter to SM L1TEX. As such one pass may get data traffic from another independent engine or units such as copy engines, display, instruction cache, and MMU can drive small differences in values that for a small grid launch can result in out of bounds or inconsistent values.
You may get more deterministic results by calculating your own hit rate as
Using srcunit_tex will remove some temporal data accesses from other engines such as the copy engine and may remove inconsistent MMU accesses. This is still not 100% accurate on systems that cannot collected both hit and miss on the same pass.
“require 3 passes”
If metrics take more than 1 pass then NCU will add an additional pass to try to reduce cost of state restore.
Pass 1 - NCU optimizes what data needs to be restored
Pass 2 - collect hit
Pass 3 - collect miss
Some GPUs support collection of hit and miss in the same pass.
For many metrics you need to saturate the GPU in order to resolve the multi-pass issue.