[Jetson-TK1] How to measure DRAM <-> L2 R/W bandwidth on Tegra K1?

[Edit] See third post below

I am trying to force global loads through the L2 cache on the Tegra K1 (Kepler). I have designed a very simple kernel where each thread block loads the same consecutive array of 8-bit unsigned integers using the following PTX instruction:

ld.global.cg.u8 %lda, [%src]

The .cg in the instruction should force the GPU to cache the global memory access in the L2 cache (as per this documentation: http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld).

To prevent the compiler from optimising away my global load, which I struggled with for a long time, I put this in at the end:

if(!src){
    src[index] = xxxx
}

When I run this in nvprof, I get the same global load throughput as L2 read throughput:

gld_throughput           Global Load Throughput  2.1534GB/s  2.1534GB/s  2.1534GB/s
l2_read_throughput       L2 Throughput (Reads)   2.2487GB/s  2.2487GB/s  2.2487GB/s

…indicating that nothing is indeed put in the L2 cache, even though 1M/32 thread blocks access the same consecutive array of 32 8-bit unsigned integers.

What am I missing here? Why is my data not getting cached? Is there any way to force nvcc to accept my code, even though it thinks it’s “stupid”, without purpose and non-optimised? :P

/tearing my hair out after > 16 hours of PTX assembly coding

Also adding an example kernel, and the nvprof output. This kernel is meant to resemble some very simple operation to add two numbers, where

uint8_t *matrix

should be cached as all thread blocks access this structure!

/* Tests l2 cached memory loads */
__global__ void memcpy_l2_cached_read(uint8_t *buf, uint8_t *matrix){

    uint8_t var;
    uint16_t var2;
    int ind, i;
    ind = threadIdx.x;

    uint8_t *src, *lsrc;
    src = &matrix[ind];

    ind = threadIdx.x + blockDim.x * blockIdx.x;
    lsrc = &buf[ind];

    /* Variable declarations */
    asm(
            ".reg .s32 %src;\n\t"
            ".reg .s32 %lsrc;\n\t"
            ".reg .u8 %lda;\n\t"
            ".reg .u8 %ldb;\n\t"
            ".reg .u16 %ldc;\n\t"
            ".reg .u16 %ldd;\n\t"
            ".reg .u16 %ldr;\n\t"
            ".reg .u8 %lds;\n\t"
       );

    /* Variable initialisations */
    asm(
            "mov.u32 %src, %0;\n\t"
            "cvta.to.global.u32 %src, %src;\n\t"
            "mov.u32 %lsrc, %1;\n\t"
            "cvta.to.global.u32 %lsrc, %lsrc;\n\t"

            :: "r"(src), "r"(lsrc)
       );

    /* Actual loading */
    asm(
            "ld.global.cg.u8 %lda, [%src];\n\t"
            "ld.global.u8 %ldb, [%lsrc];\n\t"
            "cvt.u16.u8 %ldc, %lda;\n\t"
            "cvt.u16.u8 %ldd, %ldb;\n\t"
            "add.u16 %ldr, %ldd, %ldc;\n\t"
            "mov.u16 %0, %ldr;\n\t"
            : "=h"(var2)
       );

    var = (uint8_t) var2;

    if(!buf){

       buf[ind] = var * *src;
    }

}

nvprof

==1672== Event result:
Invocations                                Event Name         Min         Max         Avg
Device "GK20A (0)"
        Kernel: memcpy_l2_cached_read(unsigned char*, unsigned char*)
          1        l2_subp0_total_read_sector_queries      978969      978969      978969
          1       l2_subp0_total_write_sector_queries      312521      312521      312521
          1                         elapsed_cycles_sm    11829920    11829920    11829920
          1          uncached_global_load_transaction      937500      937500      937500

==1672== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GK20A (0)"
        Kernel: memcpy_l2_cached_read(unsigned char*, unsigned char*)
          1                  l1_cache_global_hit_rate                        L1 Global Hit Rate       0.00%       0.00%       0.00%
          1                   l1_cache_local_hit_rate                         L1 Local Hit Rate       0.00%       0.00%       0.00%
          1                                       ipc                              Executed IPC    0.368545    0.368545    0.368545
          1                  gld_requested_throughput          Requested Global Load Throughput  2.1534GB/s  2.1534GB/s  2.1534GB/s
          1                            gst_throughput                   Global Store Throughput  717.81MB/s  717.81MB/s  717.81MB/s
          1                            gld_throughput                    Global Load Throughput  2.1534GB/s  2.1534GB/s  2.1534GB/s
          1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                                issued_ipc                                Issued IPC    0.477173    0.477173    0.477173
          1                          gld_transactions                  Global Load Transactions      937500      937500      937500
          1                          gst_transactions                 Global Store Transactions      312500      312500      312500
          1                     local_load_throughput              Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    local_store_throughput             Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    shared_load_throughput             Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   shared_store_throughput            Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                             inst_executed                     Instructions Executed     4375000     4375000     4375000
          1                     nc_l2_read_throughput        L2 Throughput (Non-Coherent Reads)  0.00000B/s  0.00000B/s  0.00000B/s
          1                   nc_l2_read_transactions         L2 Non-Coherent Read Transactions           0           0           0
          1                                inst_fp_32                   FP Instructions(Single)           0           0           0
          1                                inst_fp_64                   FP Instructions(Double)           0           0           0
          1                              inst_integer                      Integer Instructions    50000000    50000000    50000000
          1                          inst_bit_convert                  Bit-Convert Instructions    10000000    10000000    10000000
          1                              inst_control                 Control-Flow Instructions    10000000    10000000    10000000
          1                        inst_compute_ld_st                   Load/Store Instructions    40000000    40000000    40000000
          1                      l2_read_transactions                      L2 Read Transactions      978969      978969      978969
          1                     l2_write_transactions                     L2 Write Transactions      312521      312521      312521
          1                        l2_read_throughput                     L2 Throughput (Reads)  2.2487GB/s  2.2487GB/s  2.2487GB/s
          1                       l2_write_throughput                    L2 Throughput (Writes)  717.85MB/s  717.85MB/s  717.85MB/s

I found out that the event counters listed (gld_throughput and l2_read_throughput) do not actually represent the DRAM <-> L2 bandwidth. According to this post: http://stackoverflow.com/questions/11926222/how-to-profile-the-number-of-global-memory-transactions-without-using-uncached-g I need the event counter

fb_read_transactions

for this purpose. However, the Tegra K1 does not expose this event counter. Is there any other way to measure DRAM R/W bandwidth (than using event counters)?

I found a temporary solution to this problem:

https://devtalk.nvidia.com/default/topic/864546/cuda-programming-and-performance/measuring-dram-throughput/?offset=12#4634724