I am trying to understand better how the P100 unified cache works and if I can control for which loads the cache is used.
I did an easy experiment with a very simple kernel
__global__ void test(
volatile float * in, volatile float *out
)
{
int x=threadIdx.x;
out[x]=in[x];
}
It is executed as follows: (just 1 warp)
test<<<1,32>>>(in,out);
The reason I am using volatile is that according to Nvidia Documents the compiler is instructed that memory can change and forces not to use the L1 cache.
I compile the code using the -Xptxas --dlcm=cg to again disable L1 caching
nvcc -gencode arch=compute_60,code=sm_60 -Xptxas -dlcm=cg -Xptxas -dscm=wt experiment.cu
I go to profiling the using nvprof and the profiler output is as follows:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla P100-SXM2-16GB (0)"
Kernel: test(float volatile *, float volatile *)
1 inst_per_warp Instructions per warp 12.000000 12.000000 12.000000
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
1 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
1 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 100.00% 100.00% 100.00%
1 inst_replay_overhead Instruction Replay Overhead 0.166667 0.166667 0.166667
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 gld_transactions_per_request Global Load Transactions Per Request 8.000000 8.000000 8.000000
1 gst_transactions_per_request Global Store Transactions Per Request 4.000000 4.000000 4.000000
1 shared_store_transactions Shared Store Transactions 0 0 0
1 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 0 0 0
1 local_store_transactions Local Store Transactions 0 0 0
1 gld_transactions Global Load Transactions 8 8 8
1 gst_transactions Global Store Transactions 4 4 4
1 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
1 l2_read_transactions L2 Read Transactions 36 36 36
1 l2_write_transactions L2 Write Transactions 17 17 17
1 dram_read_transactions Device Memory Read Transactions 0 0 0
1 dram_write_transactions Device Memory Write Transactions 4 4 4
1 global_hit_rate Global Hit Rate in unified l1/tex 50.00% 50.00% 50.00%
1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
1 gld_requested_throughput Requested Global Load Throughput 45.481MB/s 45.481MB/s 45.480MB/s
1 gst_requested_throughput Requested Global Store Throughput 45.481MB/s 45.481MB/s 45.480MB/s
1 gld_throughput Global Load Throughput 45.481MB/s 45.481MB/s 45.480MB/s
1 gst_throughput Global Store Throughput 45.481MB/s 45.481MB/s 45.480MB/s
1 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00%
1 tex_cache_hit_rate Unified Cache Hit Rate 50.00% 50.00% 50.00%
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 100.00% 100.00% 100.00%
1 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 100.00% 100.00% 100.00%
1 dram_read_throughput Device Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 dram_write_throughput Device Memory Write Throughput 45.481MB/s 45.481MB/s 45.480MB/s
1 tex_cache_throughput Unified Cache Throughput 45.481MB/s 45.481MB/s 45.480MB/s
1 l2_tex_read_throughput L2 Throughput (Texture Reads) 45.481MB/s 45.481MB/s 45.480MB/s
1 l2_tex_write_throughput L2 Throughput (Texture Writes) 45.481MB/s 45.481MB/s 45.480MB/s
1 l2_read_throughput L2 Throughput (Reads) 409.33MB/s 409.33MB/s 409.33MB/s
1 l2_write_throughput L2 Throughput (Writes) 193.29MB/s 193.29MB/s 193.29MB/s
1 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 56.851MB/s 56.851MB/s 56.851MB/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 tex_cache_transactions Unified Cache Transactions 4 4 4
1 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 inst_executed Instructions Executed 12 12 12
1 inst_issued Instructions Issued 14 14 14
1 dram_utilization Device Memory Utilization Low (1) Low (1) Low (1)
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 4.60% 4.60% 4.60%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 6.26% 6.26% 6.26%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 28.86% 28.86% 28.86%
1 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
1 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
1 stall_other Issue Stall Reasons (Other) 3.70% 3.70% 3.70%
1 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 54.92% 54.92% 54.92%
1 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.13% 0.13% 0.13%
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 inst_fp_32 FP Instructions(Single) 0 0 0
1 inst_fp_64 FP Instructions(Double) 0 0 0
1 inst_integer Integer Instructions 192 192 192
1 inst_bit_convert Bit-Convert Instructions 0 0 0
1 inst_control Control-Flow Instructions 32 32 32
1 inst_compute_ld_st Load/Store Instructions 64 64 64
1 inst_misc Misc Instructions 96 96 96
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 issue_slots Issue Slots 14 14 14
1 cf_issued Issued Control-Flow Instructions 1 1 1
1 cf_executed Executed Control-Flow Instructions 1 1 1
1 ldst_issued Issued Load/Store Instructions 9 9 9
1 ldst_executed Executed Load/Store Instructions 3 3 3
1 atomic_transactions Atomic Transactions 0 0 0
1 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 4 4 4
1 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 1.53% 1.53% 1.53%
1 stall_not_selected Issue Stall Reasons (Not Selected) 0.00% 0.00% 0.00%
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 4 4 4
1 nvlink_total_data_transmitted NVLink Total Data Transmitted 0 0 0
1 nvlink_total_data_received NVLink Total Data Received 0 0 0
1 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
1 nvlink_user_data_received NVLink User Data Received 0 0 0
1 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 0.00% 0.00% 0.00%
1 nvlink_overhead_data_received NVLink Overhead Data Received 0.00% 0.00% 0.00%
1 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
1 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
1 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
1 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
1 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
1 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
1 nvlink_transmit_throughput NVLink Transmit Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 nvlink_receive_throughput NVLink Receive Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 nvlink_total_response_data_received NVLink Total Response Data Received 0 0 0
1 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
1 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
1 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 ipc Executed IPC 0.011776 0.011776 0.011776
1 issued_ipc Issued IPC 0.013739 0.013739 0.013739
1 issue_slot_utilization Issue Slot Utilization 0.69% 0.69% 0.69%
1 sm_efficiency Multiprocessor Activity 0.30% 0.30% 0.30%
1 achieved_occupancy Achieved Occupancy 0.015625 0.015625 0.015625
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.017413 0.017413 0.017413
1 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
1 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
1 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
1 tex_fu_utilization Texture Function Unit Utilization Low (1) Low (1) Low (1)
1 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
1 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
I have a lot of questions here:
The unified cache is clearly being used. Is there a way to disable it, for some loads? I want to save the unified cache memory for some other loads). If no, has the function __ldg() any use on the P100?
Why are there 8 gld_transactions while there are 4 tex_cache_transactions and 4 l2_tex_read_transactions?
What is the difference between global_hit_rate and tex_cache_hit_rate? They are both the same unified cache, and why are they set to 50% and not 0%?