Nothing which can be used for this purpose:
xxxxxx@tegra-ubuntu:~$ nvprof --query-metrics --query-events
Available Events:
Name Description
Device 0 (GK20A):
Domain domain_a:
tex0_cache_sector_queries: Number of texture cache 0 requests. This increments by 1 for each 32-byte access.
tex1_cache_sector_queries: Number of texture cache 1 requests. This increments by 1 for each 32-byte access.
tex0_cache_sector_misses: Number of texture cache 0 misses. This increments by 1 for each 32-byte access.
tex1_cache_sector_misses: Number of texture cache 1 misses. This increments by 1 for each 32-byte access.
rocache_subp0_gld_warp_count_128b: Number of 128-bit global load requests via slice 0 of read-only data cache. Increments per warp.
rocache_subp0_gld_warp_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 0 of read-only data cache. Increments per warp.
rocache_subp0_gld_warp_count_64b: Number of 64-bit global load requests via slice 0 of read-only data cache. Increments per warp.
rocache_subp1_gld_warp_count_128b: Number of 128-bit global load requests via slice 1 of read-only data cache. Increments per warp.
rocache_subp1_gld_warp_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 1 of read-only data cache. Increments per warp.
rocache_subp1_gld_warp_count_64b: Number of 64-bit global load requests via slice 1 of read-only data cache. Increments per warp.
rocache_subp0_gld_thread_count_128b: Number of 128-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.
rocache_subp0_gld_thread_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.
rocache_subp0_gld_thread_count_64b: Number of 64-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.
rocache_subp1_gld_thread_count_128b: Number of 128-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.
rocache_subp1_gld_thread_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.
rocache_subp1_gld_thread_count_64b: Number of 64-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.
elapsed_cycles_sm: Elapsed clocks
Domain domain_b:
l2_subp0_read_tex_sector_queries: Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_sysmem_sector_queries: Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_write_sysmem_sector_queries: Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_total_read_sector_queries: Total read requests to slice 0 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp0_total_write_sector_queries: Total write requests to slice 0 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
Domain domain_c:
gld_inst_8bit: Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_16bit: Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_32bit: Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_64bit: Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_128bit: Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.
gst_inst_8bit: Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_16bit: Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_32bit: Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_64bit: Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_128bit: Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.
rocache_gld_inst_8bit: Total number of 8-bit global load via read-only data cache that are executed by all the threads across all thread blocks.
rocache_gld_inst_16bit: Total number of 16-bit global load via read-only data cache that are executed by all the threads across all thread blocks.
rocache_gld_inst_32bit: Total number of 32-bit global load via read-only data cache that are executed by all the threads across all thread blocks.
rocache_gld_inst_64bit: Total number of 64-bit global load via read-only data cache that are executed by all the threads across all thread blocks.
rocache_gld_inst_128bit: Total number of 128-bit global load via read-only data cache that are executed by all the threads across all thread blocks.
Domain domain_d:
prof_trigger_00: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_01: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_02: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_03: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_04: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_05: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_06: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_07: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
warps_launched: Number of warps launched on a multiprocessor.
threads_launched: Number of threads launched on a multiprocessor.
inst_issued1: Number of single instruction issued per cycle
inst_issued2: Number of dual instructions issued per cycle
inst_executed: Number of instructions executed, do not include replays.
thread_inst_executed: Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction.
not_predicated_off_thread_inst_executed: Number of not predicated off instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads that execute this instruction.
atom_count: Number of warps executing atomic reduction operations. Increments by one if at least one thread in a warp executes the instruction.
atom_cas_count: Number of warps executing atomic compare and swap operations. Increments by one if at least one thread in a warp executes the instruction.
gred_count: Number of warps executing reduction operations on global memory. Increments by one if at least one thread in a warp executes the instruction
shared_load: Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.
shared_store: Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.
local_load: Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.
local_store: Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.
gld_request: Number of executed load instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the load operations from global,local and shared state space.
gst_request: Number of executed store instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the store operations to global,local and shared state space.
active_cycles: Number of cycles a multiprocessor has at least one active warp.
active_warps: Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 64.
sm_cta_launched: Number of thread blocks launched on a multiprocessor.
local_load_transactions: Number of local load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
local_store_transactions: Number of local store transactions to L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
l1_shared_load_transactions: Number of shared load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
l1_shared_store_transactions: Number of shared store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
__l1_global_load_transactions: Number of global load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
__l1_global_store_transactions: Number of global store transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
l1_local_load_hit: Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
l1_local_load_miss: Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
l1_local_store_hit: Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
l1_local_store_miss: Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32,64 and 128 bit accesses by a warp respectively.
l1_global_load_hit: Number of cache lines that hit in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
l1_global_load_miss: Number of cache lines that miss in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
uncached_global_load_transaction: Number of uncached global load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
global_store_transaction: Number of global store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
global_ld_mem_divergence_replays: Number of instruction replays for global memory loads. Instruction is replayed if the instruction is accessing more than one cache line of 128 bytes. For each extra cache line access the counter is incremented by 1.
global_st_mem_divergence_replays: Number of instruction replays for global memory stores. Instruction is replayed if the instruction is accessing more than one cache line of 128 bytes. For each extra cache line access the counter is incremented by 1.
shared_load_replay: Replays caused due to shared load bank conflict (when the addresses for two or more shared memory load requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be loaded in one cycle (256 bytes).
shared_store_replay: Replays caused due to shared store bank conflict (when the addresses for two or more shared memory store requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be stored in one cycle.
Available Metrics:
Name Description
Device 0 (GK20A):
l1_cache_global_hit_rate: Hit rate in L1 cache for global loads
l1_cache_local_hit_rate: Hit rate in L1 cache for local loads and stores
sm_efficiency: The percentage of time at least one warp is active on a multiprocessor averaged over all multiprocessors on the GPU
ipc: Instructions executed per cycle
achieved_occupancy: Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
gld_requested_throughput: Requested global memory load throughput
gst_requested_throughput: Requested global memory store throughput
sm_efficiency_instance: The percentage of time at least one warp is active on a multiprocessor
ipc_instance: Instructions executed per cycle for a single multiprocessor
inst_replay_overhead: Average number of replays for each instruction executed
shared_replay_overhead: Average number of replays due to shared memory conflicts for each instruction executed
global_replay_overhead: Average number of replays due to global memory cache misses for each instruction executed
global_cache_replay_overhead: Average number of replays due to global memory cache misses for each instruction executed
tex_cache_hit_rate: Texture cache hit rate
tex_cache_throughput: Texture cache throughput
gst_throughput: Global memory store throughput
gld_throughput: Global memory load throughput
local_replay_overhead: Average number of replays due to local memory accesses for each instruction executed
shared_efficiency: Ratio of requested shared memory throughput to required shared memory throughput
gld_efficiency: Ratio of requested global memory load throughput to required global memory load throughput. Values greater than 100% indicate that, on average, the load requests of multiple threads in a warp fetched from the same memory address
gst_efficiency: Ratio of requested global memory store throughput to required global memory store throughput. Values greater than 100% indicate that, on average, the store requests of multiple threads in a warp targeted the same memory address
warp_execution_efficiency: Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
nc_gld_requested_throughput: Requested throughput for global memory loaded via non-coherent cache
issued_ipc: Instructions issued per cycle
inst_per_warp: Average number of instructions executed by each warp
issue_slot_utilization: Percentage of issue slots that issued at least one instruction, averaged across all cycles
local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load
shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store
gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load
gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store
local_load_transactions: Number of local memory load transactions
local_store_transactions: Number of local memory store transactions
shared_load_transactions: Number of shared memory load transactions
shared_store_transactions: Number of shared memory store transactions
gld_transactions: Number of global memory load transactions
gst_transactions: Number of global memory store transactions
tex_cache_transactions: Texture cache read transactions
local_load_throughput: Local memory load throughput
local_store_throughput: Local memory store throughput
shared_load_throughput: Shared memory load throughput
shared_store_throughput: Shared memory store throughput
warp_nonpred_execution_efficiency: Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor
cf_issued: Number of issued control-flow instructions
cf_executed: Number of executed control-flow instructions
ldst_issued: Number of issued local, global, shared and texture memory load and store instructions
ldst_executed: Number of executed local, global, shared and texture memory load and store instructions
flop_count_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)
flop_count_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads
flop_count_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads
flop_count_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads
flop_count_dp: Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)
flop_count_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads
flop_count_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads
flop_count_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads
flop_count_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads
stall_inst_fetch: Percentage of stalls occurring because the next assembly instruction has not yet been fetched
stall_exec_dependency: Percentage of stalls occurring because an input required by the instruction is not yet available
stall_memory_dependency: Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
stall_texture: Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
stall_sync: Percentage of stalls occurring because the warp is blocked at a __syncthreads() call
stall_other: Percentage of stalls occurring due to miscellaneous reasons
tex_utilization: The utilization level of the texture cache relative to the peak utilization
ldst_fu_utilization: The utilization level of the multiprocessor function units that execute global, local and shared memory instructions
alu_fu_utilization: The utilization level of the multiprocessor function units that execute integer and floating-point arithmetic instructions
cf_fu_utilization: The utilization level of the multiprocessor function units that execute control-flow instructions
tex_fu_utilization: The utilization level of the multiprocessor function units that execute texture instructions
inst_executed: The number of instructions executed
inst_issued: The number of instructions issued
issue_slots: The number of issue slots used
nc_l2_read_throughput: Memory read throughput for non-coherent global read requests seen at L2 cache
nc_l2_read_transactions: Memory read transactions seen at L2 cache for non-coherent global read requests
nc_cache_global_hit_rate: Hit rate in non-coherent cache for global loads
nc_gld_throughput: Non-coherent global memory load throughput
nc_gld_efficiency: Ratio of requested non-coherent global memory load throughput to required non-coherent global memory load throughput
inst_fp_32: Number of single-precision floating-point instructions executed by non-predicated threads (arithmetric, compare, etc.)
inst_fp_64: Number of double-precision floating-point instructions executed by non-predicated threads (arithmetric, compare, etc.)
inst_integer: Number of integer instructions executed by non-predicated threads
inst_bit_convert: Number of bit-conversion instructions executed by non-predicated threads
inst_control: Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)
inst_compute_ld_st: Number of compute load/store instructions executed by non-predicated threads
inst_misc: Number of miscellaneous instructions executed by non-predicated threads
inst_inter_thread_communication: Number of inter-thread communication instructions executed by non-predicated threads
atomic_replay_overhead: Average number of replays due to atomic and reduction bank conflicts for each instruction executed
atomic_transactions: Global memory atomic and reduction transactions
atomic_transactions_per_request: Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction
l2_read_transactions: Memory read transactions seen at L2 cache for all read requests
l2_write_transactions: Memory write transactions seen at L2 cache for all write requests
l2_texture_read_throughput: Memory read throughput seen at L2 cache for read requests from the texture cache
l2_read_throughput: Memory read throughput seen at L2 cache for all read requests
l2_write_throughput: Memory write throughput seen at L2 cache for all write requests
l2_atomic_throughput: Memory read throughput seen at L2 cache for atomic and reduction requests
l2_utilization: The utilization level of the L2 cache relative to the peak utilization
dram_utilization: The utilization level of the device memory relative to the peak utilization
l2_tex_read_transactions: Memory read transactions seen at L2 cache for read requests from the texture cache
l2_atomic_transactions: Memory read transactions seen at L2 cache for atomic and reduction requests
flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations
flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations
stall_pipe_busy: Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy
stall_constant_memory_dependency: Percentage of stalls occurring because of immediate constant cache miss
stall_memory_throttle: Percentage of stalls occurring because of memory throttle
stall_not_selected: Percentage of stalls occurring because warp was not selected
eligible_warps_per_cycle: Average number of warps that are eligible to issue per active cycle
atomic_throughput: Global memory atomic and reduction throughput