vecadd is just a simple vector addition CUDA program. Though I’m sure the kernel is finished successfully, the output min, max, and avg for the metrics l1_cache_global_hit_rate and l2_cache_global_hit_rate are all 0.00%. Does that mean K40 and Titan Z do not support profiling L1 and L2 cache hit ratios?
In the vector add sample each warp reads 32 consecutive 32-bit values from unique addresses. No address is read or written multiple times so as expected the cache hit rate is 0%. If you change the sample such that for one of the vectors every thread reads from B[0] (same address for all threads) then you should see read hit rate of ~50% because all A accesses miss and all B accesses hit (expect for first access on every SM). If you see this in L2 but not L1 follow the directions in the link above to try to enable L1 caching. You may also have to look at the assembly code (not PTX) as the compiler will likely access A and B vectors using LDG instruction which uses the texture cache not the L1 cache.