Can't profile L1 and L2 hit ratios on K40 and Titan Z

I tried to profile L1 and L2 cache hit ratios on K40 and Titan Z cards through the following command.

nvprof --metrics l1_cache_global_hit_rate ./vecadd

vecadd is just a simple vector addition CUDA program. Though I’m sure the kernel is finished successfully, the output min, max, and avg for the metrics l1_cache_global_hit_rate and l2_cache_global_hit_rate are all 0.00%. Does that mean K40 and Titan Z do not support profiling L1 and L2 cache hit ratios?

Well, it only looks like you are asking for l1 cache hit rate, right?

K40 is a kepler device. It has L1 turned off for ordinary global load caching.

Titan Z is also a kepler device. It also has L1 turned off for the same scenario.

The above items are covered in the documentation. Take a look at the kepler tuning guide, for example.

Regarding L2, for a very simple program (say, that reads a vector exactly once) it’s possible that it never hits in L2.

K40 and TitanZ are both based upon gk110b. You should be able to enable L1 caching of global loads in both chips. See [url]Programming Guide :: CUDA Toolkit Documentation.

In the vector add sample each warp reads 32 consecutive 32-bit values from unique addresses. No address is read or written multiple times so as expected the cache hit rate is 0%. If you change the sample such that for one of the vectors every thread reads from B[0] (same address for all threads) then you should see read hit rate of ~50% because all A accesses miss and all B accesses hit (expect for first access on every SM). If you see this in L2 but not L1 follow the directions in the link above to try to enable L1 caching. You may also have to look at the assembly code (not PTX) as the compiler will likely access A and B vectors using LDG instruction which uses the texture cache not the L1 cache.