L2 cache in A100 provides 179% hit rate!

user40368 · December 15, 2022, 6:52am

Hello every one, I’m building a model for the A100 GPU, and to do that, I needed to demystify the caches.
While I doing that, I found that sometimes (not only once) the L2 cache provides a hitrate more than 100%
for example it provided 179%, 130% and 102%
The benchmark that I’m running is polybench->linear_algebra->gramchmit app
ramschmidt_kernel3(int, int, float*, float*, float*, int), 2022-Dec-14 23:30:37, Context 1, Stream 7
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput Mbyte/second 34.48
Mem Busy % 0.67
Max Bandwidth % 0.42
L1/TEX Hit Rate % 0
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 179.38
Mem Pipes Busy % 0.01
---------------------------------------------------------------------- --------------- ------------------------------

Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM                                                                   block                             32
Block Limit Registers                                                            block                              8
Block Limit Shared Mem                                                           block                            164
Block Limit Warps                                                                block                              8
Theoretical Active Warps per SM                                                   warp                             64
Theoretical Occupancy                                                                %                            100
Achieved Occupancy                                                                   %                          12.36
Achieved Active Warps Per SM                                                      warp                           7.91
---------------------------------------------------------------------- --------------- ------------------------------

Robert_Crovella · December 15, 2022, 3:37pm

This can be an artifact of the profiler, because it is doing a kind of GPU sampling and then scaling that measurement across the entire GPU. It’s impossible to say if that is the case with the limited info you have provided. (In my experience this kind of artifact can arise when the GPU is not at full occupancy, which appears to be the case in your output.)

If you’re asking “generally” under what circumstances the profiler could report a higher than 100% hit rate in L2, I suggest asking that question on the nsight compute forum.

user40368 · December 16, 2022, 12:40am

thanks for your answer
what info I can add to make it more clear ?

Robert_Crovella · December 16, 2022, 12:54am

A short, complete test case, and the full output from your ncu cli session (not just the memory workload and occupancy sections.)

Also see here. That is the most likely cause. If its observed from your test case that your kernel launch does not saturate the GPU, then the response will be the same: increase the GPU workload to saturate the GPU. (Or just ignore the L2 cache hit rate number.)

And if this devolves into a profiler behavior discussion, I will direct you to the profiler forums, as already indicated.

user40368 · December 19, 2022, 8:43am

Ok, thanks
I attached The full report with all metrics from the ncu

GPU: A100
Driver Version: 515.48.07
CUDA Version: 11.3.1
This is the benchmark :

and to be specific, the app path is : main/Benchmarks/PolyBench/linear-algebra/gramschmidt

I’m using 108 blocks (1 block per sm)
and 256 thread per block

final1.txt (81.9 MB)

Robert_Crovella · December 19, 2022, 2:38pm

Then I suggest increasing the number of threads until there are the maximum complement per SM.

user40368 · December 24, 2022, 11:12pm

I did that and the problem still exists

Robert_Crovella · December 25, 2022, 3:17pm

probably best to ask about it on the profiler forum that I already linked. Other possible suggestions would be to update to the latest CUDA version and latest profiler version and retest.

Topic		Replies	Views
L2 cache in A100 provides 179% hit rate! Nsight Compute	1	726	January 4, 2023
L2 cache hit rate of a streaming kernel is not as expected profiled in ncu CUDA Programming and Performance nsight	2	924	March 22, 2023
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	2731	July 3, 2024
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1365	April 25, 2020
L1 and L2 cache hit rate CUDA Programming and Performance	8	6492	February 3, 2016
Understanding Caching/Flushing Behavior/Performance in computeprof for Kepler CUDA Programming and Performance	6	3319	September 19, 2014
L1 Cache Hit Rate is Zero on Pascal CUDA Programming and Performance	2	584	November 29, 2021
L1 cache hits 0% CUDA Programming and Performance	2	1090	June 1, 2013
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	1870	October 17, 2023
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	948	April 7, 2024

L2 cache in A100 provides 179% hit rate!

Related topics