Get a 100% L1 Cache Hit Rate

Hello ! I’m still learning about CUDA and GPUs. I have been checking the forums but I don’t find it clear why am I obtaining these results in the percentage of cache hits. In the following example code I’m reading the same value from global memory in every thread, and storing the data on shared memory so I don’t modify the original value.

Im using a GTX 980Ti.

#include <stdio.h>
const int BLOCKS = 100;
const int WARPS_BLOCK = 16;
const int THREADS_WARP = 32;
const int THREADS_BLOCK = WARPS_BLOCK * THREADS_WARP;
__global__ void test(int* dummy) {
    __shared__ int dummy_shared[THREADS_BLOCK];
    dummy_shared[threadIdx.x] = dummy[0];
}
int main() {
    int* dummy;
    cudaMalloc((void**)&dummy, sizeof(int));
    cudaMemset(dummy, 0, sizeof(int));

    test<<<BLOCKS,THREADS_BLOCK>>>(dummy);

    cudaDeviceSynchronize();
    cudaFree(dummy);
    return 0;
}

nvprof shows that L2 hit rate for reads is 100% as I was expecting. Also, I’m not using the Local Memory, so I’m expecting Local Hit rate to be 0%.

           L2 Hit Rate (Texture Reads)     100.00%     100.00%     100.00%
          L2 Hit Rate (Texture Writes)       0.00%       0.00%       0.00%
                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
                        Local Hit Rate       0.00%       0.00%       0.00%

My issue resides in L1/tex cache. I was expecting to have a 100% hit rate, but I’m not able to go pass 75%. At first I was writting and reading to the same global address, but I suspected that different SMs would be writting to that same address and the L1 cache should mark the data as invalid/old, so the L1 cache of other SMs would have to request the data again. But changing the writes to shared memory did not make any improvement.

Using shared memory this way, maybe warps from different SMs could write to the same address causing the performance to slow down, but is a different problem that as I understand should affect the cache hit rate.

     Global Hit Rate in unified l1/tex      75.00%      75.00%      75.00%
              Unified Cache Throughput  39.662GB/s  39.662GB/s  39.662GB/s
            Unified Cache Transactions        6400        6400        6400
             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
                Unified Cache Hit Rate      75.00%      75.00%      75.00%

I would also like to ask, what is unified cache in this context ?. The concept of unified memory as I understood is when the host and the device are sharing the same address space for memory, but since I’m using cudaMalloc, I’m not using this concept. Is the Unified Cache here referring to that unified memory ? Is it something different ?

I also checked that the dissambled SASS code isn’t making any changes in the kernel behavior:

        code for sm_30
                Function : _Z8testPi
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                           /* 0x2203f28042804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];   /* 0x2800400110005de4 */
        /*0010*/                   MOV R2, c[0x0][0x140];  /* 0x2800400500009de4 */
        /*0018*/                   MOV R3, c[0x0][0x144];  /* 0x280040051000dde4 */
        /*0020*/                   LD.E R2, [R2];          /* 0x8400000000209c85 */
        /*0028*/                   S2R R4, SR_TID.X;       /* 0x2c00000084011c04 */
        /*0030*/                   SHL R4, R4, 0x2;        /* 0x6000c00008411c03 */
        /*0038*/                   STS [R4], R2;           /* 0xc900000000409c85 */
                                                           /* 0x20000000000002f7 */
        /*0048*/                   EXIT;                   /* 0x8000000000001de7 */
        /*0050*/                   BRA 0x50;               /* 0x4003ffffe0001de7 */
        /*0058*/                   NOP;                    /* 0x4000000000001de4 */
        /*0060*/                   NOP;                    /* 0x4000000000001de4 */
        /*0068*/                   NOP;                    /* 0x4000000000001de4 */
        /*0070*/                   NOP;                    /* 0x4000000000001de4 */
        /*0078*/                   NOP;                    /* 0x4000000000001de4 */
                ........................

Im summary: why is the L1 hit rate not 100% ? How would a very simple example kernel with a 100% hit rate on L1 look like ?

Thank you for your time.

GTX 980ti is compute capability 5.2 (GM20x). The L1TEX cache does not support caching global memory accesses unless the data is read-only for the lifetime of the kernel. The CUDA Programming Guide Section on Compute Capability 5.x Global Memory provides details on how to accomplish this.

Why is the L1 cache hit rate 75%?

The Compute Capability 5.x and 6.x L1TEX cache processes requests in order. 32-bit global load instructions are sent to the L1TEX in 4 groups of 8 threads. The current code does not identify dummy[0] as read-only so the value cannot be cached beyond the current warp. The first request misses. Due to guaranteed in order processing the remaining 3 requests will hit (covered miss) so the hit rate is 75%.

What will the L1 cache hit rate be if dummy is marked read-only?

If you mark dummy as const __restrict__ then the data should be cached. The GTX 980ti has 22 SMs. Each SM has 2 independent L1TEX caches. The code launches 100 blocks x 16 warps/block.

average_warps_per_l1tex = (100 x 16) / (22 x 2) = 36

Each warp will execute 1 global load which is 4 requests to the L1TEX. 1 request per cache will miss.

l1tex_hit_rate = (36 x 4 - 1) / (36 x 4) x 100. = 99.3%

2 Likes