L2 cache in A100 provides 179% hit rate!

Hello every one, I’m building a model for the A100 GPU, and to do that, I needed to demystify the caches.
While I doing that, I found that sometimes (not only once) the L2 cache provides a hitrate more than 100%
for example it provided 179%, 130% and 102%
The benchmark that I’m running is polybench->linear_algebra->gramchmit app
ramschmidt_kernel3(int, int, float*, float*, float*, int), 2022-Dec-14 23:30:37, Context 1, Stream 7
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput Mbyte/second 34.48
Mem Busy % 0.67
Max Bandwidth % 0.42
L1/TEX Hit Rate % 0
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 179.38
Mem Pipes Busy % 0.01
---------------------------------------------------------------------- --------------- ------------------------------

Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM                                                                   block                             32
Block Limit Registers                                                            block                              8
Block Limit Shared Mem                                                           block                            164
Block Limit Warps                                                                block                              8
Theoretical Active Warps per SM                                                   warp                             64
Theoretical Occupancy                                                                %                            100
Achieved Occupancy                                                                   %                          12.36
Achieved Active Warps Per SM                                                      warp                           7.91
---------------------------------------------------------------------- --------------- ------------------------------

This can be an artifact of the profiler, because it is doing a kind of GPU sampling and then scaling that measurement across the entire GPU. It’s impossible to say if that is the case with the limited info you have provided. (In my experience this kind of artifact can arise when the GPU is not at full occupancy, which appears to be the case in your output.)

If you’re asking “generally” under what circumstances the profiler could report a higher than 100% hit rate in L2, I suggest asking that question on the nsight compute forum.

thanks for your answer
what info I can add to make it more clear ?

A short, complete test case, and the full output from your ncu cli session (not just the memory workload and occupancy sections.)

Also see here. That is the most likely cause. If its observed from your test case that your kernel launch does not saturate the GPU, then the response will be the same: increase the GPU workload to saturate the GPU. (Or just ignore the L2 cache hit rate number.)

And if this devolves into a profiler behavior discussion, I will direct you to the profiler forums, as already indicated.

Ok, thanks
I attached The full report with all metrics from the ncu

GPU: A100
Driver Version: 515.48.07
CUDA Version: 11.3.1
This is the benchmark :

and to be specific, the app path is : main/Benchmarks/PolyBench/linear-algebra/gramschmidt

I’m using 108 blocks (1 block per sm)
and 256 thread per block

final1.txt (81.9 MB)

Then I suggest increasing the number of threads until there are the maximum complement per SM.

I did that and the problem still exists

probably best to ask about it on the profiler forum that I already linked. Other possible suggestions would be to update to the latest CUDA version and latest profiler version and retest.