How to check the occupancy rate of GPU memory?

How to check the occupancy rate of GPU memory and SP’s utilization on Jetson TK1?

nvprof can measure it for you. Check nvprof --query-metrics for things that can be reported. Occupancy and streaming multiprocessor utilization are included on TK1.

ubuntu@tegra-ubuntu:/usr/local/cuda/bin$ ./nvprof --query-metrics
Available Metrics:
                            Name   Description
Device 0 (GK20A):
        l1_cache_global_hit_rate:  Hit rate in L1 cache for global loads

         l1_cache_local_hit_rate:  Hit rate in L1 cache for local loads and stores

                   sm_efficiency:  The percentage of time at least one warp is active on a multiprocessor averaged over all multiprocessors on the GPU

                             ipc:  Instructions executed per cycle

              achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

        gld_requested_throughput:  Requested global memory load throughput

        gst_requested_throughput:  Requested global memory store throughput

          sm_efficiency_instance:  The percentage of time at least one warp is active on a multiprocessor

                    ipc_instance:  Instructions executed per cycle for a single multiprocessor

            inst_replay_overhead:  Average number of replays for each instruction executed

          shared_replay_overhead:  Average number of replays due to shared memory conflicts for each instruction executed

          global_replay_overhead:  Average number of replays due to global memory cache misses for each instruction executed

    global_cache_replay_overhead:  Average number of replays due to global memory cache misses for each instruction executed

              tex_cache_hit_rate:  Texture cache hit rate

            tex_cache_throughput:  Texture cache throughput

                  gst_throughput:  Global memory store throughput

                  gld_throughput:  Global memory load throughput

           local_replay_overhead:  Average number of replays due to local memory accesses for each instruction executed

               shared_efficiency:  Ratio of requested shared memory throughput to required shared memory throughput

                  gld_efficiency:  Ratio of requested global memory load throughput to required global memory load throughput. Values greater than 100% indicate that, on average, the load requests of multiple threads in a warp fetched from the same memory address

                  gst_efficiency:  Ratio of requested global memory store throughput to required global memory store throughput. Values greater than 100% indicate that, on average, the store requests of multiple threads in a warp targeted the same memory address

       warp_execution_efficiency:  Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor

     nc_gld_requested_throughput:  Requested throughput for global memory loaded via non-coherent cache

                      issued_ipc:  Instructions issued per cycle

                   inst_per_warp:  Average number of instructions executed by each warp

          issue_slot_utilization:  Percentage of issue slots that issued at least one instruction, averaged across all cycles

local_load_transactions_per_request:  Average number of local memory load transactions performed for each local memory load

local_store_transactions_per_request:  Average number of local memory store transactions performed for each local memory store

shared_load_transactions_per_request:  Average number of shared memory load transactions performed for each shared memory load

shared_store_transactions_per_request:  Average number of shared memory store transactions performed for each shared memory store

    gld_transactions_per_request:  Average number of global memory load transactions performed for each global memory load

    gst_transactions_per_request:  Average number of global memory store transactions performed for each global memory store

         local_load_transactions:  Number of local memory load transactions

        local_store_transactions:  Number of local memory store transactions

        shared_load_transactions:  Number of shared memory load transactions

       shared_store_transactions:  Number of shared memory store transactions

                gld_transactions:  Number of global memory load transactions

                gst_transactions:  Number of global memory store transactions

          tex_cache_transactions:  Texture cache read transactions

           local_load_throughput:  Local memory load throughput

          local_store_throughput:  Local memory store throughput

          shared_load_throughput:  Shared memory load throughput

         shared_store_throughput:  Shared memory store throughput

warp_nonpred_execution_efficiency:  Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor

                       cf_issued:  Number of issued control-flow instructions

                     cf_executed:  Number of executed control-flow instructions

                     ldst_issued:  Number of issued local, global, shared and texture memory load and store instructions

                   ldst_executed:  Number of executed local, global, shared and texture memory load and store instructions

                   flop_count_sp:  Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)

               flop_count_sp_add:  Number of single-precision floating-point add operations executed by non-predicated threads

               flop_count_sp_mul:  Number of single-precision floating-point multiply operations executed by non-predicated threads

               flop_count_sp_fma:  Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads

                   flop_count_dp:  Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)

               flop_count_dp_add:  Number of double-precision floating-point add operations executed by non-predicated threads

               flop_count_dp_mul:  Number of double-precision floating-point multiply operations executed by non-predicated threads

               flop_count_dp_fma:  Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads

           flop_count_sp_special:  Number of single-precision floating-point special operations executed by non-predicated threads

                stall_inst_fetch:  Percentage of stalls occurring because the next assembly instruction has not yet been fetched

           stall_exec_dependency:  Percentage of stalls occurring because an input required by the instruction is not yet available

         stall_memory_dependency:  Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding

                   stall_texture:  Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests

                      stall_sync:  Percentage of stalls occurring because the warp is blocked at a __syncthreads() call

                     stall_other:  Percentage of stalls occurring due to miscellaneous reasons

                 tex_utilization:  The utilization level of the texture cache relative to the peak utilization

             ldst_fu_utilization:  The utilization level of the multiprocessor function units that execute global, local and shared memory instructions

              alu_fu_utilization:  The utilization level of the multiprocessor function units that execute integer and floating-point arithmetic instructions

               cf_fu_utilization:  The utilization level of the multiprocessor function units that execute control-flow instructions

              tex_fu_utilization:  The utilization level of the multiprocessor function units that execute texture instructions

                   inst_executed:  The number of instructions executed

                     inst_issued:  The number of instructions issued

                     issue_slots:  The number of issue slots used

           nc_l2_read_throughput:  Memory read throughput for non-coherent global read requests seen at L2 cache

         nc_l2_read_transactions:  Memory read transactions seen at L2 cache for non-coherent global read requests

        nc_cache_global_hit_rate:  Hit rate in non-coherent cache for global loads

               nc_gld_throughput:  Non-coherent global memory load throughput

               nc_gld_efficiency:  Ratio of requested non-coherent global memory load throughput to required non-coherent global memory load throughput

                      inst_fp_32:  Number of single-precision floating-point instructions executed by non-predicated threads (arithmetric, compare, etc.)

                      inst_fp_64:  Number of double-precision floating-point instructions executed by non-predicated threads (arithmetric, compare, etc.)

                    inst_integer:  Number of integer instructions executed by non-predicated threads

                inst_bit_convert:  Number of bit-conversion instructions executed by non-predicated threads

                    inst_control:  Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)

              inst_compute_ld_st:  Number of compute load/store instructions executed by non-predicated threads

                       inst_misc:  Number of miscellaneous instructions executed by non-predicated threads

 inst_inter_thread_communication:  Number of inter-thread communication instructions executed by non-predicated threads

          atomic_replay_overhead:  Average number of replays due to atomic and reduction bank conflicts for each instruction executed

             atomic_transactions:  Global memory atomic and reduction transactions

 atomic_transactions_per_request:  Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction

            l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests

           l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests

      l2_texture_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache

              l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests

             l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests

            l2_atomic_throughput:  Memory read throughput seen at L2 cache for atomic and reduction requests

                  l2_utilization:  The utilization level of the L2 cache relative to the peak utilization

                dram_utilization:  The utilization level of the device memory relative to the peak utilization

        l2_tex_read_transactions:  Memory read transactions seen at L2 cache for read requests from the texture cache

          l2_atomic_transactions:  Memory read transactions seen at L2 cache for atomic and reduction requests

              flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations

              flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations

                 stall_pipe_busy:  Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy

stall_constant_memory_dependency:  Percentage of stalls occurring because of immediate constant cache miss

           stall_memory_throttle:  Percentage of stalls occurring because of memory throttle

              stall_not_selected:  Percentage of stalls occurring because warp was not selected

        eligible_warps_per_cycle:  Average number of warps that are eligible to issue per active cycle

               atomic_throughput:  Global memory atomic and reduction throughput