How to profile L1 and L2 hit ratios on Tesla C2050 cards using the command-line profiler?


I need to profile the cache hit ratios to see the details of some optimizations. How can I do that using the command-line profiler? I prefer the command-line profiler, as I need to profile a large number of runs. It seems the cuda command-line profiler can not recognize “l1_cache_global_hit_rate” and “l2_l1_read_hit_rate” in the configuration file.


The CUDA command line profile only supports collection of raw counters. hitrate is a metric. nvprof (ships with CUDA 5.0 and above) supports capture of metrics.

The following write-up was from a quick glance at nvprof --query-events. I did not test the results. I would recommend that you run nvprof or visual profiler on one kernel and compare the results. These directions are for gf100 only.

For the CUDA Comamnd Line Profiler

L1 Hit Rate

  • Add to the config file
  • This will not include uncached global loads, global stores, or atomics.

l1_cache_global_hit_rate = l1_global_load_hit / (l1_global_load_hit + l1_global_load_miss)

L2 L1 Read Hit Rate

  • Add to the config file
  • This cannot be collected for both subp0 and subp1 in the same pass. The hitrate for sub-partitions is usually very consistent.

l2_l1_read_hit_rate = l2_subp0_read_hit_sectors / l2_subp0_read_sector_queries