L1/L2 cache profiling in jetson nano

Hello,

in an attempt to understand the behaviour of L1/L2 cache in jetson nano (Maxwell Architecture).

So according to the documentation of Maxwell :
Kernel memory requests are typically served between the device DRAM and SM on-chip memory using either 128-byte or 32-byte memory transactions.
If both L1 and L2 caches are used, a memory access is serviced by a 128-byte memory transaction. If only the L2 cache is used, a memory access is serviced by a 32-byte memory transaction.

To check this : I compiled my program using (-Xptxas -dlcm=cg) and (-Xptxas -dlcm=ca)

__global__ void readOffset(float*  A, const int n,
                           int offset)
{
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int k = i + offset;
    __shared__ float saveVal[512];
    //if(k < n) 
    saveVal[threadIdx.x] = A[k]; // +B[k];
}

void L1_L2Cache(int nOffset)
{
    int N =  1024*32;
    float *A;

    cudaMallocManaged((void**)&A, N * sizeof(float));
   

    dim3 block(512);
    dim3 grid((N + block.x - 1) / block.x);

    float time;
    cudaEvent_t tstart, tstop;
    (cudaEventCreate(&tstart));
    (cudaEventCreate(&tstop));

    (cudaEventRecord(tstart, 0));

    readOffset << <grid,block >> > (A,N,nOffset);

    cudaDeviceSynchronize();

    (cudaEventRecord(tstop, 0));
    (cudaEventSynchronize(tstop));

    (cudaEventElapsedTime(&time, tstart, tstop));

    printf("L1/L2 cache Time :  %3.5f ms \n", time);
    cudaFree(A); 
}

int main() 
{
    L1_L2Cache(0);
    return 0;
}

Using Nvprof, it shows the following metrics for both compilation options :

Device "NVIDIA Tegra X1 (0)"
    Kernel: readOffset(float*, int, int)
          1                      l2_global_load_bytes   Bytes read from L2 for misses in Unified Cache for global loads      131072      131072      131072
          1                    tex_cache_transactions                                        Unified Cache Transactions        4096        4096        4096
          1                      global_load_requests          Total number of global load requests from Multiprocessor        4096        4096        4096
          1              gld_transactions_per_request                              Global Load Transactions Per Request   16.001953   16.001953   16.001953
          1                          gld_transactions                                          Global Load Transactions       16386       16386       16386
          1                             ldst_executed                                  Executed Load/Store Instructions        4096        4096        4096
          1                      l2_read_transactions                                              L2 Read Transactions        4196        4196        4196

To my understanding :
Accessing 1024x32 array items requires 1024 warp-level instructions (=requests) with 32 threads each. each thread reads 4 bytes meaning each warp will read 32x4 = 128 bytes. if L1 is enabled <=> 1 warp memory transaction, if L1 is disabled <=> 4 warp memory transactions which is what I see 1024x4 (Executed Load/Store Instructions).

Can someone please explain these numbers. Thank you

The L1 and L2 cache both have 128B cache lines consisting of 4 x 32B sectors.

** WARNING conflicting naming conventions to follow **

Global load instructions to the Maxwell and Pascal unified L1 cache are broken into requests of 8 threads. The threads in a request are not N consecutive lanes. The threads groups map to a design for texture cache design that processes quads (4 threads per request). This result in not intuitive PM counter values for local and global accesses.

If a warp instruction accesses a 128 byte range that is 128 aligned (maps to 1 cache line) and if each thread accesses a consecutive 32-bit value in lane ID order, then there will be a total of 8 sector queries. 4 of the sectors queries will miss and 4 of the sectors queries will hit (really hit under miss). The sectors are counted as hits as the initial miss allocated the miss request. The Maxwell / Pascal unified cache is in order to any sector access after a miss is a hit (annoying and not intuitive). If caching is disable the hit/miss is still true as it pertains to a single warp instruction. The hit does not benefit any other instruction.
The total number of requests to L2 will be 1 request of 4 sectors for each warp instruction.

** NVPROF naming conventions ***
NVPROF uses a different nomenclature than above (and different from other performance tools) as nvprof was tyring to keep consistent with Kepler architecture that used the terms (request = instruction, transaction = interaction with the cache).

nvprof term     menaing
l1 request        memory instruction executed
l2 request        rd, wr, atom, atom_cas, red, ... operatation of 1-4 sectors
l1 transaction sub command of a l1 memory instruction consisting of N threads that will flow through the l1 pipeline.

The counters for 1 warp to access1024 x 32 uint32_ts where a warp accesses 32 consecutive uint32_t would be

1024        # of ld instructions executed = global requests
4096        # of l1 requests = l1 transactions
8192        # of l2 sectors read
1024        # of l2 requests
4096        # of l2 sectors read

In terms of responding to your code and PM counters.
The code posted is not what was run as the code posted does not have a side effect and would be reduced to an empty function unless you compiled with -G in which case you should fix that as the amount of debug code generated will greatly impact the counters.

If I try to reason from the code

readOffset<<<64, 512>>>

The memory access is aligned given offset == 0.

The values in the counter output do not match my expectations as you would have only executed 1024 instructions. The fact this shows 4096 leads me to believe the counters are not for the code snippet.

The gld__transaction_per_request does not match my expectation of 8.

Please provide a full reproducible and please annotate the grid launch dimensions and any other variable not fixed in the source code.

Thanks for your reply @Greg

  • No the provided result actually corresponds to the provided code. (the input vector is of type float)
  • The code does have side effect, I am storing the loaded global values in a shared variable (saveVal),
  • The code above has all the grid launch dimensions (64, 512), and compiled using the following command :

nvcc -O3 --keep -gencode arch=compute_53,code=sm_53 L1_L2Cache.cu -o L1_L2Cache

and the generated SASS file is as follows :

 Function _Z10readOffsetPfii:
  REG:5 STACK:0 SHARED:2048 LOCAL:0 CONSTANT[0]:336 TEXTURE:0 SURFACE:0 SAMPLER:0

        code for sm_53
                Function : _Z10readOffsetPfii
        .headerflags    @"EF_CUDA_SM53 EF_CUDA_PTX_SM(EF_CUDA_SM53)"
                                                                                 /* 0x001cfc00e22007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                        /* 0x4c98078000870001 */
        /*0010*/                   S2R R4, SR_TID.X ;                            /* 0xf0c8000002170004 */
        /*0018*/                   S2R R2, SR_CTAID.X ;                          /* 0xf0c8000002570002 */
                                                                                 /* 0x081fd842fea00ff1 */
        /*0028*/                   IADD R0, R4, c[0x0][0x14c] ;                  /* 0x4c10000005370400 */
        /*0030*/                   XMAD.MRG R3, R2.reuse, c[0x0] [0x8].H1, RZ ;  /* 0x4f107f8000270203 */
        /*0038*/                   XMAD R0, R2.reuse, c[0x0] [0x8], R0 ;         /* 0x4e00000000270200 */
                                                                                 /* 0x001fd800fc2007f6 */
        /*0048*/                   XMAD.PSL.CBCC R2, R2.H1, R3.H1, R0 ;          /* 0x5b30001800370202 */
        /*0050*/                   SHR.U32 R0, R2, 0x1e ;                        /* 0x3828000001e70200 */
        /*0058*/                   ISCADD R2.CC, R2, c[0x0][0x140], 0x2 ;        /* 0x4c18810005070202 */
                                                                                 /* 0x001fd000f62007f2 */
        /*0068*/                   IADD.X R3, R0, c[0x0][0x144] ;                /* 0x4c10080005170003 */
        /*0070*/                   LDG.E.CI R2, [R2] ;                           /* 0xeed4a00000070202 */
        /*0078*/                   SHL R0, R4, 0x2 ;                             /* 0x3848000000270400 */
                                                                                 /* 0x001ffc00ffe107f1 */
        /*0088*/                   STS [R0], R2 ;                                /* 0xef5c000000070002 */
        /*0090*/                   EXIT ;                                        /* 0xe30000000007000f */
        /*0098*/                   BRA 0x98 ;                                    /* 0xe2400fffff87000f */
                                                                                 /* 0x001f8000fc0007e0 */
        /*00a8*/                   NOP;                                          /* 0x50b0000000070f00 */
        /*00b0*/                   NOP;                                          /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                          /* 0x50b0000000070f00 */
                .............................