Get a 100% L1 Cache Hit Rate

jsierra.siete · July 30, 2020, 10:19am

Hello ! I’m still learning about CUDA and GPUs. I have been checking the forums but I don’t find it clear why am I obtaining these results in the percentage of cache hits. In the following example code I’m reading the same value from global memory in every thread, and storing the data on shared memory so I don’t modify the original value.

Im using a GTX 980Ti.

#include <stdio.h>
const int BLOCKS = 100;
const int WARPS_BLOCK = 16;
const int THREADS_WARP = 32;
const int THREADS_BLOCK = WARPS_BLOCK * THREADS_WARP;

__global__ void test(int* dummy) {
    __shared__ int dummy_shared[THREADS_BLOCK];
    dummy_shared[threadIdx.x] = dummy[0];
}

int main() {
    int* dummy;
    cudaMalloc((void**)&dummy, sizeof(int));
    cudaMemset(dummy, 0, sizeof(int));

    test<<<BLOCKS,THREADS_BLOCK>>>(dummy);

    cudaDeviceSynchronize();
    cudaFree(dummy);
    return 0;
}

nvprof shows that L2 hit rate for reads is 100% as I was expecting. Also, I’m not using the Local Memory, so I’m expecting Local Hit rate to be 0%.

           L2 Hit Rate (Texture Reads)     100.00%     100.00%     100.00%
          L2 Hit Rate (Texture Writes)       0.00%       0.00%       0.00%
                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
                        Local Hit Rate       0.00%       0.00%       0.00%

My issue resides in L1/tex cache. I was expecting to have a 100% hit rate, but I’m not able to go pass 75%. At first I was writting and reading to the same global address, but I suspected that different SMs would be writting to that same address and the L1 cache should mark the data as invalid/old, so the L1 cache of other SMs would have to request the data again. But changing the writes to shared memory did not make any improvement.

Using shared memory this way, maybe warps from different SMs could write to the same address causing the performance to slow down, but is a different problem that as I understand should affect the cache hit rate.

     Global Hit Rate in unified l1/tex      75.00%      75.00%      75.00%
              Unified Cache Throughput  39.662GB/s  39.662GB/s  39.662GB/s
            Unified Cache Transactions        6400        6400        6400
             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
                Unified Cache Hit Rate      75.00%      75.00%      75.00%

I would also like to ask, what is unified cache in this context ?. The concept of unified memory as I understood is when the host and the device are sharing the same address space for memory, but since I’m using cudaMalloc, I’m not using this concept. Is the Unified Cache here referring to that unified memory ? Is it something different ?

I also checked that the dissambled SASS code isn’t making any changes in the kernel behavior:

        code for sm_30
                Function : _Z8testPi
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                           /* 0x2203f28042804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];   /* 0x2800400110005de4 */
        /*0010*/                   MOV R2, c[0x0][0x140];  /* 0x2800400500009de4 */
        /*0018*/                   MOV R3, c[0x0][0x144];  /* 0x280040051000dde4 */
        /*0020*/                   LD.E R2, [R2];          /* 0x8400000000209c85 */
        /*0028*/                   S2R R4, SR_TID.X;       /* 0x2c00000084011c04 */
        /*0030*/                   SHL R4, R4, 0x2;        /* 0x6000c00008411c03 */
        /*0038*/                   STS [R4], R2;           /* 0xc900000000409c85 */
                                                           /* 0x20000000000002f7 */
        /*0048*/                   EXIT;                   /* 0x8000000000001de7 */
        /*0050*/                   BRA 0x50;               /* 0x4003ffffe0001de7 */
        /*0058*/                   NOP;                    /* 0x4000000000001de4 */
        /*0060*/                   NOP;                    /* 0x4000000000001de4 */
        /*0068*/                   NOP;                    /* 0x4000000000001de4 */
        /*0070*/                   NOP;                    /* 0x4000000000001de4 */
        /*0078*/                   NOP;                    /* 0x4000000000001de4 */
                ........................

Im summary: why is the L1 hit rate not 100% ? How would a very simple example kernel with a 100% hit rate on L1 look like ?

Thank you for your time.

Greg · July 31, 2020, 8:04pm

GTX 980ti is compute capability 5.2 (GM20x). The L1TEX cache does not support caching global memory accesses unless the data is read-only for the lifetime of the kernel. The CUDA Programming Guide Section on Compute Capability 5.x Global Memory provides details on how to accomplish this.

Why is the L1 cache hit rate 75%?

The Compute Capability 5.x and 6.x L1TEX cache processes requests in order. 32-bit global load instructions are sent to the L1TEX in 4 groups of 8 threads. The current code does not identify dummy[0] as read-only so the value cannot be cached beyond the current warp. The first request misses. Due to guaranteed in order processing the remaining 3 requests will hit (covered miss) so the hit rate is 75%.

What will the L1 cache hit rate be if dummy is marked read-only?

If you mark dummy as const __restrict__ then the data should be cached. The GTX 980ti has 22 SMs. Each SM has 2 independent L1TEX caches. The code launches 100 blocks x 16 warps/block.

average_warps_per_l1tex = (100 x 16) / (22 x 2) = 36

Each warp will execute 1 global load which is 4 requests to the L1TEX. 1 request per cache will miss.

l1tex_hit_rate = (36 x 4 - 1) / (36 x 4) x 100. = 99.3%

jsierra.siete · August 25, 2020, 9:10am

Thank you for the information.

Topic		Replies	Views
L2 Hit Rate(Texture Reads) becomes 100% when modifying memory never used CUDA Programming and Performance	7	2773	March 17, 2018
what is the difference between tex_cache_hit_rate and global_hit_rate? Visual Profiler and nvprof	4	1319	July 3, 2018
P100 unified cache behaviour and how to disable it CUDA Programming and Performance	8	2149	October 25, 2018
L1 and L2 cache hit rate CUDA Programming and Performance	8	6793	February 3, 2016
P100 global_hit_rate and and tex_cache_hit_rate CUDA Programming and Performance	6	950	November 4, 2018
Why the L2 cache hit rate in L1 store requests is 100% Nsight Compute	0	329	April 18, 2025
L1 cache hits 0% CUDA Programming and Performance	2	1166	June 1, 2013
L2 Cache mechanism for streaming data? CUDA Programming and Performance	1	836	August 25, 2022
Understanding the functioning of nvprof and .cv data load option CUDA Programming and Performance	8	3256	December 11, 2014
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1843	January 5, 2017

Get a 100% L1 Cache Hit Rate

Related topics