P100 unified cache behaviour and how to disable it

I am trying to understand better how the P100 unified cache works and if I can control for which loads the cache is used.

I did an easy experiment with a very simple kernel

__global__ void test(
        volatile float * in, volatile float *out
        )
{
        int x=threadIdx.x;
                out[x]=in[x];

}

It is executed as follows: (just 1 warp)

test<<<1,32>>>(in,out);

The reason I am using volatile is that according to Nvidia Documents the compiler is instructed that memory can change and forces not to use the L1 cache.

I compile the code using the -Xptxas --dlcm=cg to again disable L1 caching

nvcc -gencode arch=compute_60,code=sm_60 -Xptxas -dlcm=cg -Xptxas -dscm=wt experiment.cu

I go to profiling the using nvprof and the profiler output is as follows:

Invocations                               Metric Name                                    Metric Description         Min         Max         Avg
Device "Tesla P100-SXM2-16GB (0)"
    Kernel: test(float volatile *, float volatile *)
          1                             inst_per_warp                                 Instructions per warp   12.000000   12.000000   12.000000
          1                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
          1                 warp_execution_efficiency                             Warp Execution Efficiency     100.00%     100.00%     100.00%
          1         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency     100.00%     100.00%     100.00%
          1                      inst_replay_overhead                           Instruction Replay Overhead    0.166667    0.166667    0.166667
          1      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          1     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          1      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1              gld_transactions_per_request                  Global Load Transactions Per Request    8.000000    8.000000    8.000000
          1              gst_transactions_per_request                 Global Store Transactions Per Request    4.000000    4.000000    4.000000
          1                 shared_store_transactions                             Shared Store Transactions           0           0           0
          1                  shared_load_transactions                              Shared Load Transactions           0           0           0
          1                   local_load_transactions                               Local Load Transactions           0           0           0
          1                  local_store_transactions                              Local Store Transactions           0           0           0
          1                          gld_transactions                              Global Load Transactions           8           8           8
          1                          gst_transactions                             Global Store Transactions           4           4           4
          1                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
          1                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
          1                      l2_read_transactions                                  L2 Read Transactions          36          36          36
          1                     l2_write_transactions                                 L2 Write Transactions          17          17          17
          1                    dram_read_transactions                       Device Memory Read Transactions           0           0           0
          1                   dram_write_transactions                      Device Memory Write Transactions           4           4           4
          1                           global_hit_rate                     Global Hit Rate in unified l1/tex      50.00%      50.00%      50.00%
          1                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
          1                  gld_requested_throughput                      Requested Global Load Throughput  45.481MB/s  45.481MB/s  45.480MB/s
          1                  gst_requested_throughput                     Requested Global Store Throughput  45.481MB/s  45.481MB/s  45.480MB/s
          1                            gld_throughput                                Global Load Throughput  45.481MB/s  45.481MB/s  45.480MB/s
          1                            gst_throughput                               Global Store Throughput  45.481MB/s  45.481MB/s  45.480MB/s
          1                     local_memory_overhead                                 Local Memory Overhead       0.00%       0.00%       0.00%
          1                        tex_cache_hit_rate                                Unified Cache Hit Rate      50.00%      50.00%      50.00%
          1                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)     100.00%     100.00%     100.00%
          1                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)     100.00%     100.00%     100.00%
          1                      dram_read_throughput                         Device Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                     dram_write_throughput                        Device Memory Write Throughput  45.481MB/s  45.481MB/s  45.480MB/s
          1                      tex_cache_throughput                              Unified Cache Throughput  45.481MB/s  45.481MB/s  45.480MB/s
          1                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  45.481MB/s  45.481MB/s  45.480MB/s
          1                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  45.481MB/s  45.481MB/s  45.480MB/s
          1                        l2_read_throughput                                 L2 Throughput (Reads)  409.33MB/s  409.33MB/s  409.33MB/s
          1                       l2_write_throughput                                L2 Throughput (Writes)  193.29MB/s  193.29MB/s  193.29MB/s
          1                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   sysmem_write_throughput                        System Memory Write Throughput  56.851MB/s  56.851MB/s  56.851MB/s
          1                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                            gld_efficiency                         Global Memory Load Efficiency     100.00%     100.00%     100.00%
          1                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
          1                    tex_cache_transactions                            Unified Cache Transactions           4           4           4
          1                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
          1                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
          1                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
          1                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
          1                             flop_count_sp           Floating Point Operations(Single Precision)           0           0           0
          1                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
          1                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           0           0           0
          1                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
          1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           0           0           0
          1                             inst_executed                                 Instructions Executed          12          12          12
          1                               inst_issued                                   Instructions Issued          14          14          14
          1                          dram_utilization                             Device Memory Utilization     Low (1)     Low (1)     Low (1)
          1                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
          1                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       4.60%       4.60%       4.60%
          1                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       6.26%       6.26%       6.26%
          1                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      28.86%      28.86%      28.86%
          1                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
          1                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
          1                               stall_other                           Issue Stall Reasons (Other)       3.70%       3.70%       3.70%
          1          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)      54.92%      54.92%      54.92%
          1                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.13%       0.13%       0.13%
          1                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
          1                                inst_fp_32                               FP Instructions(Single)           0           0           0
          1                                inst_fp_64                               FP Instructions(Double)           0           0           0
          1                              inst_integer                                  Integer Instructions         192         192         192
          1                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
          1                              inst_control                             Control-Flow Instructions          32          32          32
          1                        inst_compute_ld_st                               Load/Store Instructions          64          64          64
          1                                 inst_misc                                     Misc Instructions          96          96          96
          1           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
          1                               issue_slots                                           Issue Slots          14          14          14
          1                                 cf_issued                      Issued Control-Flow Instructions           1           1           1
          1                               cf_executed                    Executed Control-Flow Instructions           1           1           1
          1                               ldst_issued                        Issued Load/Store Instructions           9           9           9
          1                             ldst_executed                      Executed Load/Store Instructions           3           3           3
          1                       atomic_transactions                                   Atomic Transactions           0           0           0
          1           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
          1                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
          1                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
          1                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)           4           4           4
          1                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       1.53%       1.53%       1.53%
          1                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.00%       0.00%       0.00%
          1                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)           4           4           4
          1             nvlink_total_data_transmitted                         NVLink Total Data Transmitted           0           0           0
          1                nvlink_total_data_received                            NVLink Total Data Received           0           0           0
          1              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
          1                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
          1          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       0.00%       0.00%       0.00%
          1             nvlink_overhead_data_received                         NVLink Overhead Data Received       0.00%       0.00%       0.00%
          1      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
          1       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
          1       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
          1        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
          1       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
          1        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
          1                nvlink_transmit_throughput                            NVLink Transmit Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                 nvlink_receive_throughput                             NVLink Receive Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1       nvlink_total_response_data_received                   NVLink Total Response Data Received           0           0           0
          1        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
          1                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
          1                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
          1                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
          1                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
          1                                inst_fp_16                                 HP Instructions(Half)           0           0           0
          1                                       ipc                                          Executed IPC    0.011776    0.011776    0.011776
          1                                issued_ipc                                            Issued IPC    0.013739    0.013739    0.013739
          1                    issue_slot_utilization                                Issue Slot Utilization       0.69%       0.69%       0.69%
          1                             sm_efficiency                               Multiprocessor Activity       0.30%       0.30%       0.30%
          1                        achieved_occupancy                                    Achieved Occupancy    0.015625    0.015625    0.015625
          1                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.017413    0.017413    0.017413
          1                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
          1                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
          1                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
          1                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                        tex_fu_utilization                     Texture Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                    special_fu_utilization                     Special Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
          1                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
          1                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
          1                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)

I have a lot of questions here:

The unified cache is clearly being used. Is there a way to disable it, for some loads? I want to save the unified cache memory for some other loads). If no, has the function __ldg() any use on the P100?

Why are there 8 gld_transactions while there are 4 tex_cache_transactions and 4 l2_tex_read_transactions?

What is the difference between global_hit_rate and tex_cache_hit_rate? They are both the same unified cache, and why are they set to 50% and not 0%?

Can you point to the specific spot in the documentation where it says this?

The semantics of the ‘volatile’ attribute (modifier, qualifier depending on language spec) are: “this object can be modified by an agent outside the scope of this code”. As a consequence, a compiler is usually forced to emit an actual memory access instruction for every access to this object in the code rather than loading it into a register once and using the register-based copy, as data coherency between register and memory cannot be guaranteed.

Since traditional cache hierarchies (L1, L2, L3) do provide coherency, the use of volatile has no bearing on cacheability. Note that special caches like GPU’s texture caches do not typically provide coherency for reasons of simplicity and performance.

Section F.3.3.3 of the CUDA programming guide

But anyway with or without volatile I still have the same results, that is I am going through the Unified Cache, which I would like to “switch off”

I have a hard time believing the “L1” part of the cited description is accurate. It looks like an error to me. The “register” part definitely applies, as per my previous explanation. I am not sure the cache hierarchy can be bypassed on Pascal. I seem to recall it cannot be bypassed. You could try the PTXAS flag provided for cache control:

To bypass the cache, try “cg”. The way you pass options to the PTXAS backend-compiler is by prefixing with -Xptxas on the nvcc command line: -Xptxas -dlcm=cg

That is what I did, and seems futile

I really need some doc that states it loud and clear that unified cache can’t be disabled.

Why is that? Do you have a bet going? You might want to take a look at the following paper which mentions disabling the cache and GTX 1080 (I have not read this paper):

Shi Dong, et. al.: “Characterizing the Microarchitectural Implications of a Convolutional Neural Network (CNN) Execution on GPUs”, ICPE’18, April 9–13, 2018, Berlin, Germany

Have you tried inline assembler instead of the PTXAS command line switch?

asm ("ld.global.cg.f32 %0,[%1];" : "=f"(X) : "l"(&Y));

If the tool chain is working properly, there should be no difference. But it’s worth a try if one is desperate :-)

Ok I had a look at the sass and realized that using volatile is not a good idea.
The dump using volatile is as follows

code for sm_60
                Function : _Z4testPVfS0_
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                   /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];           /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X;               /* 0xf0c8000002170000 */
        /*0018*/                   SHL R4, R0.reuse, 0x2;          /* 0x3848000000270004 */
                                                                   /* 0x001f8800fcc007e5 */
        /*0028*/                   SHR R0, R0, 0x1e;               /* 0x3829000001e70000 */
        /*0030*/                   IADD R2.CC, R4, c[0x0][0x140];  /* 0x4c10800005070402 */
        /*0038*/                   IADD.X R3, R0, c[0x0][0x144];   /* 0x4c10080005170003 */
                                                                   /* 0x001f8800fcc00771 */
        /*0048*/                   LDG.E.CV R2, [R2];              /* 0xeed4e00000070202 */
        /*0050*/                   IADD R4.CC, R4, c[0x0][0x148];  /* 0x4c10800005270404 */
        /*0058*/                   IADD.X R5, R0, c[0x0][0x14c];   /* 0x4c10080005370005 */
                                                                   /* 0x001ffc00fc8040f1 */
        /*0068*/                   STG.E.WT [R4], R2;              /* 0xeedce00000070402 */
        /*0070*/                   NOP;                            /* 0x50b0000000070f00 */
        /*0078*/                   EXIT;                           /* 0xe30000000007000f */
                                                                   /* 0x001f8000fc0007ff */
        /*0088*/                   BRA 0x80;                       /* 0xe2400fffff07000f */
        /*0090*/                   NOP;                            /* 0x50b0000000070f00 */
        /*0098*/                   NOP;                            /* 0x50b0000000070f00 */
                                                                   /* 0x001f8000fc0007e0 */
        /*00a8*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00b0*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                            /* 0x50b0000000070f00 */
                ........................

LDG.E.CV means that

I removed the volatile and now that sass dump looks like this:

code for sm_60
                Function : _Z4testPVfS0_
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                   /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];           /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X;               /* 0xf0c8000002170000 */
        /*0018*/                   SHL R4, R0.reuse, 0x2;          /* 0x3848000000270004 */
                                                                   /* 0x001f8800fcc007e5 */
        /*0028*/                   SHR R0, R0, 0x1e;               /* 0x3829000001e70000 */
        /*0030*/                   IADD R2.CC, R4, c[0x0][0x140];  /* 0x4c10800005070402 */
        /*0038*/                   IADD.X R3, R0, c[0x0][0x144];   /* 0x4c10080005170003 */
                                                                   /* 0x001f8800fcc00771 */
        /*0048*/                   LDG.E.CV R2, [R2];              /* 0xeed4e00000070202 */
        /*0050*/                   IADD R4.CC, R4, c[0x0][0x148];  /* 0x4c10800005270404 */
        /*0058*/                   IADD.X R5, R0, c[0x0][0x14c];   /* 0x4c10080005370005 */
                                                                   /* 0x001ffc00fc8040f1 */
        /*0068*/                   STG.E.WT [R4], R2;              /* 0xeedce00000070402 */
        /*0070*/                   NOP;                            /* 0x50b0000000070f00 */
        /*0078*/                   EXIT;                           /* 0xe30000000007000f */
                                                                   /* 0x001f8000fc0007ff */
        /*0088*/                   BRA 0x80;                       /* 0xe2400fffff07000f */
        /*0090*/                   NOP;                            /* 0x50b0000000070f00 */
        /*0098*/                   NOP;                            /* 0x50b0000000070f00 */
                                                                   /* 0x001f8000fc0007e0 */
        /*00a8*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00b0*/                   NOP;                            /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                            /* 0x50b0000000070f00 */
                ........................

We know have an LDG.E.CG, which is equivalent to njuffa’s idea to try.

On profiling, the new compilation statistics are still the same!!! Unified cache is being used!!

All local and global memory accesses are through the L1 cache. It is possible to control the cache policy using the compiler operation -dlcm. Even if cache operator is set to not cache in L1 the L1 cache will use temporary cache storage to coalesce all threads in the warp instruction.

Thanks a lot for this answer. I’ve been experiment a bit to understand better how this works.

I have for questions whose answers I can’t figure out through my experimentation

  1. What is the difference between global_hit_rate and tex_cache_hit_rate and how are they calculated?
  2. Is there a way to calculate hit rate on the unified cache for cacheable content only?
  3. Do atomicAdd involve the L1 cache somehow. Profiling shows no use, but asking just in case
  4. atomicAdds use L2. Are loads on L2 generated by atomicAdds cachable and is there some way to calculate Hit rate on L2 for atomicAdds?

Thanks for any future feedback given