Hello every one, I’m building a model for the A100 GPU, and to do that, I needed to demystify the caches.
While I doing that, I found that sometimes (not only once) the L2 cache provides a hitrate more than 100%
for example it provided 179%, 130% and 102%
the project details as follow:
GPU: A100
Driver Version: 515.48.07
CUDA Version: 11.3.1
This is the benchmark :
and to be specific, the app path is : main/Benchmarks/PolyBench/linear-algebra/gramschmidt
I’m using 108 blocks (1 block per sm)
and 256 thread per block
I also attached the full report of the nvsight metrics
final1.txt (81.9 MB)
ramschmidt_kernel3(int, int, float*, float*, float*, int), 2022-Dec-14 23:30:37, Context 1, Stream 7
Section: Memory Workload Analysis
Memory Throughput Mbyte/second 34.48
Mem Busy % 0.67
Max Bandwidth % 0.42
L1/TEX Hit Rate % 0
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 179.38
Mem Pipes Busy % 0.01
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 8
Block Limit Shared Mem block 164
Block Limit Warps block 8
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 12.36
Achieved Active Warps Per SM warp 7.91
---------------------------------------------------------------------- --------------- ------