Measuring DRAM throughput

krisrst · August 10, 2015, 3:33pm

Is it possible to measure DRAM bandwidth and how much data has actually been physically read/written by the GPU? Or even (more generally speaking) by the whole system (CPU+GPU)?

I am using a Jetson-TK1, so in my context, I am talking about main memory which is shared between CPU and GPU.

There are no performance counters (nvprof / CUPTI) that will provide me with this information.

Robert_Crovella · August 10, 2015, 3:41pm

have you looked at the dram metrics available with nvprof?

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference

dram_read_transactions and dram_write_transactions should give you the total data read and written by a kernel. You can divide these by the kernel execution time (also reported by nvprof --print-gpu-trace …) to get throughput numbers for a given kernel.

krisrst · August 10, 2015, 4:50pm

@txbob: Yeah, I initially struggled a lot with nvprof first, as per this thread: [url]https://devtalk.nvidia.com/default/topic/864230/embedded-systems/-jetson-tk1-how-to-measure-dram-lt-gt-l2-r-w-bandwidth-on-tegra-k1-/[/url]

The problem is, the counters you have mentioned do not exist on that GPU implementation.

I am currently trying to read DRAM usage using ARM performance counters (from the CPU side), but I’m unsure if this will work at all. I guess, if these counters are implemented there, they are only for CPU accesses, and not GPU accesses.

In /sysfs I found a Tegra activity monitor, with some sort of “average activity” file which relates to bandwidth usage, but it does not separate between reads/writes, and I don’t know the unit of the number (will check kernel sources).

Would be nice with some attention from someone in NVIDIA on this ;) Somehow, getting actual DRAM statistics (bandwidth R/W, cumulative bytes written etc) seems like such a mundane thing. Why is there nothing like this?

Robert_Crovella · August 10, 2015, 6:48pm

What is the output from

nvprof --query-metrics

on the TK1?

krisrst · August 10, 2015, 7:48pm

Nothing which can be used for this purpose:

xxxxxx@tegra-ubuntu:~$ nvprof --query-metrics --query-events
Available Events:
                            Name   Description
Device 0 (GK20A):
	Domain domain_a:
       tex0_cache_sector_queries:  Number of texture cache 0 requests. This increments by 1 for each 32-byte access.

       tex1_cache_sector_queries:  Number of texture cache 1 requests. This increments by 1 for each 32-byte access.

        tex0_cache_sector_misses:  Number of texture cache 0 misses. This increments by 1 for each 32-byte access.

        tex1_cache_sector_misses:  Number of texture cache 1 misses. This increments by 1 for each 32-byte access.

rocache_subp0_gld_warp_count_128b:  Number of 128-bit global load requests via slice 0 of read-only data cache. Increments per warp.

rocache_subp0_gld_warp_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 0 of read-only data cache. Increments per warp.

rocache_subp0_gld_warp_count_64b:  Number of 64-bit global load requests via slice 0 of read-only data cache. Increments per warp.

rocache_subp1_gld_warp_count_128b:  Number of 128-bit global load requests via slice 1 of read-only data cache. Increments per warp.

rocache_subp1_gld_warp_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 1 of read-only data cache. Increments per warp.

rocache_subp1_gld_warp_count_64b:  Number of 64-bit global load requests via slice 1 of read-only data cache. Increments per warp.

rocache_subp0_gld_thread_count_128b:  Number of 128-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp0_gld_thread_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp0_gld_thread_count_64b:  Number of 64-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp1_gld_thread_count_128b:  Number of 128-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp1_gld_thread_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp1_gld_thread_count_64b:  Number of 64-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

               elapsed_cycles_sm:  Elapsed clocks

	Domain domain_b:
l2_subp0_read_tex_sector_queries:  Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_read_sysmem_sector_queries:  Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_write_sysmem_sector_queries:  Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_total_read_sector_queries:  Total read requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp0_total_write_sector_queries:  Total write requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

	Domain domain_c:
                   gld_inst_8bit:  Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.

                  gld_inst_16bit:  Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.

                  gld_inst_32bit:  Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.

                  gld_inst_64bit:  Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.

                 gld_inst_128bit:  Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.

                   gst_inst_8bit:  Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.

                  gst_inst_16bit:  Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.

                  gst_inst_32bit:  Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.

                  gst_inst_64bit:  Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.

                 gst_inst_128bit:  Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.

           rocache_gld_inst_8bit:  Total number of 8-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

          rocache_gld_inst_16bit:  Total number of 16-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

          rocache_gld_inst_32bit:  Total number of 32-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

          rocache_gld_inst_64bit:  Total number of 64-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

         rocache_gld_inst_128bit:  Total number of 128-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

	Domain domain_d:
                 prof_trigger_00:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_01:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_02:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_03:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_04:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_05:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_06:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_07:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                  warps_launched:  Number of warps launched on a multiprocessor.

                threads_launched:  Number of threads launched on a multiprocessor.

                    inst_issued1:  Number of single instruction issued per cycle

                    inst_issued2:  Number of dual instructions issued per cycle

                   inst_executed:  Number of instructions executed, do not include replays.

            thread_inst_executed:  Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction.

not_predicated_off_thread_inst_executed:  Number of not predicated off instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads that execute this instruction.

                      atom_count:  Number of warps executing atomic reduction operations. Increments by one if at least one thread in a warp executes the instruction.

                  atom_cas_count:  Number of warps executing atomic compare and swap operations. Increments by one if at least one thread in a warp executes the instruction.

                      gred_count:  Number of warps executing reduction operations on global memory. Increments by one if at least one thread in a warp executes the instruction

                     shared_load:  Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.

                    shared_store:  Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.

                      local_load:  Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.

                     local_store:  Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.

                     gld_request:  Number of executed load instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the load operations from global,local and shared state space.

                     gst_request:  Number of executed store instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the store operations to global,local and shared state space.

                   active_cycles:  Number of cycles a multiprocessor has at least one active warp.

                    active_warps:  Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 64.

                 sm_cta_launched:  Number of thread blocks launched on a multiprocessor.

         local_load_transactions:  Number of local load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

        local_store_transactions:  Number of local store transactions to L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

     l1_shared_load_transactions:  Number of shared load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

    l1_shared_store_transactions:  Number of shared store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

   __l1_global_load_transactions:  Number of global load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

  __l1_global_store_transactions:  Number of global store transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

               l1_local_load_hit:  Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

              l1_local_load_miss:  Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

              l1_local_store_hit:  Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

             l1_local_store_miss:  Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32,64 and 128 bit accesses by a warp respectively.

              l1_global_load_hit:  Number of cache lines that hit in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

             l1_global_load_miss:  Number of cache lines that miss in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

uncached_global_load_transaction:  Number of uncached global load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

        global_store_transaction:  Number of global store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

global_ld_mem_divergence_replays:  Number of instruction replays for global memory loads. Instruction is replayed if the instruction is accessing more than one cache line of 128 bytes. For each extra cache line access the counter is incremented by 1.

global_st_mem_divergence_replays:  Number of instruction replays for global memory stores. Instruction is replayed if the instruction is accessing more than one cache line of 128 bytes. For each extra cache line access the counter is incremented by 1.

              shared_load_replay:  Replays caused due to shared load bank conflict (when the addresses for two or more shared memory load requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be loaded in one cycle (256 bytes).

             shared_store_replay:  Replays caused due to shared store bank conflict (when the addresses for two or more shared memory store requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be stored in one cycle.

Available Metrics:
                            Name   Description
Device 0 (GK20A):
        l1_cache_global_hit_rate:  Hit rate in L1 cache for global loads

         l1_cache_local_hit_rate:  Hit rate in L1 cache for local loads and stores

                   sm_efficiency:  The percentage of time at least one warp is active on a multiprocessor averaged over all multiprocessors on the GPU

                             ipc:  Instructions executed per cycle

              achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

        gld_requested_throughput:  Requested global memory load throughput

        gst_requested_throughput:  Requested global memory store throughput

          sm_efficiency_instance:  The percentage of time at least one warp is active on a multiprocessor

                    ipc_instance:  Instructions executed per cycle for a single multiprocessor

            inst_replay_overhead:  Average number of replays for each instruction executed

          shared_replay_overhead:  Average number of replays due to shared memory conflicts for each instruction executed

          global_replay_overhead:  Average number of replays due to global memory cache misses for each instruction executed

    global_cache_replay_overhead:  Average number of replays due to global memory cache misses for each instruction executed

              tex_cache_hit_rate:  Texture cache hit rate

            tex_cache_throughput:  Texture cache throughput

                  gst_throughput:  Global memory store throughput

                  gld_throughput:  Global memory load throughput

           local_replay_overhead:  Average number of replays due to local memory accesses for each instruction executed

               shared_efficiency:  Ratio of requested shared memory throughput to required shared memory throughput

                  gld_efficiency:  Ratio of requested global memory load throughput to required global memory load throughput. Values greater than 100% indicate that, on average, the load requests of multiple threads in a warp fetched from the same memory address

                  gst_efficiency:  Ratio of requested global memory store throughput to required global memory store throughput. Values greater than 100% indicate that, on average, the store requests of multiple threads in a warp targeted the same memory address

       warp_execution_efficiency:  Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor

     nc_gld_requested_throughput:  Requested throughput for global memory loaded via non-coherent cache

                      issued_ipc:  Instructions issued per cycle

                   inst_per_warp:  Average number of instructions executed by each warp

          issue_slot_utilization:  Percentage of issue slots that issued at least one instruction, averaged across all cycles

local_load_transactions_per_request:  Average number of local memory load transactions performed for each local memory load

local_store_transactions_per_request:  Average number of local memory store transactions performed for each local memory store

shared_load_transactions_per_request:  Average number of shared memory load transactions performed for each shared memory load

shared_store_transactions_per_request:  Average number of shared memory store transactions performed for each shared memory store

    gld_transactions_per_request:  Average number of global memory load transactions performed for each global memory load

    gst_transactions_per_request:  Average number of global memory store transactions performed for each global memory store

         local_load_transactions:  Number of local memory load transactions

        local_store_transactions:  Number of local memory store transactions

        shared_load_transactions:  Number of shared memory load transactions

       shared_store_transactions:  Number of shared memory store transactions

                gld_transactions:  Number of global memory load transactions

                gst_transactions:  Number of global memory store transactions

          tex_cache_transactions:  Texture cache read transactions

           local_load_throughput:  Local memory load throughput

          local_store_throughput:  Local memory store throughput

          shared_load_throughput:  Shared memory load throughput

         shared_store_throughput:  Shared memory store throughput

warp_nonpred_execution_efficiency:  Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor

                       cf_issued:  Number of issued control-flow instructions

                     cf_executed:  Number of executed control-flow instructions

                     ldst_issued:  Number of issued local, global, shared and texture memory load and store instructions

                   ldst_executed:  Number of executed local, global, shared and texture memory load and store instructions

                   flop_count_sp:  Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)

               flop_count_sp_add:  Number of single-precision floating-point add operations executed by non-predicated threads

               flop_count_sp_mul:  Number of single-precision floating-point multiply operations executed by non-predicated threads

               flop_count_sp_fma:  Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads

                   flop_count_dp:  Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)

               flop_count_dp_add:  Number of double-precision floating-point add operations executed by non-predicated threads

               flop_count_dp_mul:  Number of double-precision floating-point multiply operations executed by non-predicated threads

               flop_count_dp_fma:  Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads

           flop_count_sp_special:  Number of single-precision floating-point special operations executed by non-predicated threads

                stall_inst_fetch:  Percentage of stalls occurring because the next assembly instruction has not yet been fetched

           stall_exec_dependency:  Percentage of stalls occurring because an input required by the instruction is not yet available

         stall_memory_dependency:  Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding

                   stall_texture:  Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests

                      stall_sync:  Percentage of stalls occurring because the warp is blocked at a __syncthreads() call

                     stall_other:  Percentage of stalls occurring due to miscellaneous reasons

                 tex_utilization:  The utilization level of the texture cache relative to the peak utilization

             ldst_fu_utilization:  The utilization level of the multiprocessor function units that execute global, local and shared memory instructions

              alu_fu_utilization:  The utilization level of the multiprocessor function units that execute integer and floating-point arithmetic instructions

               cf_fu_utilization:  The utilization level of the multiprocessor function units that execute control-flow instructions

              tex_fu_utilization:  The utilization level of the multiprocessor function units that execute texture instructions

                   inst_executed:  The number of instructions executed

                     inst_issued:  The number of instructions issued

                     issue_slots:  The number of issue slots used

           nc_l2_read_throughput:  Memory read throughput for non-coherent global read requests seen at L2 cache

         nc_l2_read_transactions:  Memory read transactions seen at L2 cache for non-coherent global read requests

        nc_cache_global_hit_rate:  Hit rate in non-coherent cache for global loads

               nc_gld_throughput:  Non-coherent global memory load throughput

               nc_gld_efficiency:  Ratio of requested non-coherent global memory load throughput to required non-coherent global memory load throughput

                      inst_fp_32:  Number of single-precision floating-point instructions executed by non-predicated threads (arithmetric, compare, etc.)

                      inst_fp_64:  Number of double-precision floating-point instructions executed by non-predicated threads (arithmetric, compare, etc.)

                    inst_integer:  Number of integer instructions executed by non-predicated threads

                inst_bit_convert:  Number of bit-conversion instructions executed by non-predicated threads

                    inst_control:  Number of control-flow instructions executed by non-predicated threads (jump, branch, etc.)

              inst_compute_ld_st:  Number of compute load/store instructions executed by non-predicated threads

                       inst_misc:  Number of miscellaneous instructions executed by non-predicated threads

 inst_inter_thread_communication:  Number of inter-thread communication instructions executed by non-predicated threads

          atomic_replay_overhead:  Average number of replays due to atomic and reduction bank conflicts for each instruction executed

             atomic_transactions:  Global memory atomic and reduction transactions

 atomic_transactions_per_request:  Average number of global memory atomic and reduction transactions performed for each atomic and reduction instruction

            l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests

           l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests

      l2_texture_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache

              l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests

             l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests

            l2_atomic_throughput:  Memory read throughput seen at L2 cache for atomic and reduction requests

                  l2_utilization:  The utilization level of the L2 cache relative to the peak utilization

                dram_utilization:  The utilization level of the device memory relative to the peak utilization

        l2_tex_read_transactions:  Memory read transactions seen at L2 cache for read requests from the texture cache

          l2_atomic_transactions:  Memory read transactions seen at L2 cache for atomic and reduction requests

              flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations

              flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations

                 stall_pipe_busy:  Percentage of stalls occurring because a compute operation cannot be performed because the compute pipeline is busy

stall_constant_memory_dependency:  Percentage of stalls occurring because of immediate constant cache miss

           stall_memory_throttle:  Percentage of stalls occurring because of memory throttle

              stall_not_selected:  Percentage of stalls occurring because warp was not selected

        eligible_warps_per_cycle:  Average number of warps that are eligible to issue per active cycle

               atomic_throughput:  Global memory atomic and reduction throughput

njuffa · August 11, 2015, 2:37am

Wouldn’t these metrics from the list of available metrics work for you?

gst_throughput:  Global memory store throughput
gld_throughput:  Global memory load throughput

Robert_Crovella · August 11, 2015, 3:09am

Those metrics refer to the logical memory space “global”, which is not the same as actual transactions to dram. global memory load/store throughput can be very high even if transactions to dram is very low, if there are lots of hits in the cache.

If you simply want to measure dram bandwidth, bandwidthTest may be useful. dram_utilization then might give you a rough measure of the percentage of available bandwidth that is used by a kernel, but this is only a metric from 1-10, I believe, so it won’t be any where near the granularity of e.g. dram_read_transactions which metric is available on cc2.0, 3.0, 3.5, etc.

There might be some other clever combination of the given metrics that would start with e.g. L2 transactions, then use the cache hit ratio to figure out a more granular percentage. But it’s not occurring to me at the moment.

njuffa · August 11, 2015, 4:26am

Thanks for clarifying the GLD/GST throughput metrics, I was not aware that this refers to the pre-cache traffic. I am guessing that on a system with physically unified memory it is difficult for the GPU to report actual DRAM usage, which is why no such metric is being offered.

krisrst · August 11, 2015, 5:49am

Well this would depend on the hardware’s (embedded memory controller) ability to report performance statistics. In the Tegra K1 TRM, chapter 16.3.5 “Memory controller → statistics and debugging”, it is stated that the hardware (memory controller?) indeed has performance counters (without stating which ones). I will have a look through this today and see what I can find.

In general, I have been through the entire /sys and /proc interface on my Jetson-TK1. The only thing I found was

/sys/kernel/debug/tegra_actmon/emc/avg_activity

That file reports activity levels in DRAM (read/write), but not the amount written, nor the throughput. I can see it inceasing when performing bandwidth-intensive GPU kernels.

krisrst · August 11, 2015, 6:07am

I need to know the exact RAM usage from any program. Bandwidthtest (and any benchmark / performance tools on google) just shows the maximum bandwidth, not actually what is being written or read.

The following performance counters are available:

List of compatible hardware events:
	CYCLES
	INSTRUCTIONS
	CACHE-REFERENCES
	CACHE-MISSES
	BRANCHES
	BRANCH-MISSES
	BUS-CYCLES
	REF-CYCLES

List of compatible hardware cache events:
	L1D_READ_ACCESS
	L1D_READ_MISS
	L1D_WRITE_ACCESS
	L1D_WRITE_MISS
	L1I_READ_ACCESS
	L1I_READ_MISS
	LL_READ_ACCESS
	LL_READ_MISS
	LL_WRITE_ACCESS
	LL_WRITE_MISS
	DTLB_READ_MISS
	DTLB_WRITE_MISS
	ITLB_READ_MISS
	ITLB_WRITE_MISS
	BPU_READ_ACCESS
	BPU_READ_MISS
	BPU_WRITE_ACCESS
	BPU_WRITE_MISS

That is, there are no memory performance counters for the EMC (embedded memory controller).

Greg · August 11, 2015, 12:30pm

NVIDIA has not published a tool for collecting the memory controller counters on TK1. I recommend you file a RFE (request for enhancement) through the registered developers program.

krisrst · August 12, 2015, 9:13am

I managed to track DRAM activity on the Tegra K1 - by implementing it myself :) However this is not an indication of how much was actually read or written - but an indication of how many busy cycles have occurred on the Tegra K1’s Memory Controllers (MC).

Allow me to fill you in…

The Tegra K1 has several ACtivity Monitors (ACM), which are described in the Tegra K1 TRM. I will go into some detail about this now, hoping I do not cross some sort of line when speaking about the contents of the TRM. The ACMs are meant for tracking hardware activity, and feed the Dynamic Voltage and Frequency Scaling (DVFS) with input to select the appropriate frequency level - for example for the Embedded Memory Controller (EMC).

There are two ACM which are interesting.

MC BUSY CYCLES (CPU ONLY) ACM -> Counts busy CPU MC cycles.
MC BUSY CYCLES (GLOBAL) ACM -> Counts all busy cycles (regardless of source, including GPU)

In other words, one can estimate busy GPU MC cycles by subtracting the former from the latter.

I rewrote the Tegra K1 ACM driver to count the total amount of busy cycles. Then, I run some simple CUDA code to read / write DRAM (L2/L1 caches are completely disabled by using the ld.global.cv instrution). In short, I get the following:

GPU memcpy H->D: 100 MB total, 13569024 busy GPU cycles = 7,3 B/c
GPU memcpy D->H: 100 MB total, 12412928 busy GPU cycles = 8,0 B/c

So I present to you: a way to measure DRAM bandwidth on the Tegra K1 :) Enjoy!

dongxiao · October 11, 2019, 3:35pm

I managed to track DRAM activity on the Tegra K1 - by implementing it myself :) However this is not an indication of how much was actually read or written - but an indication of how many busy cycles have occurred on the Tegra K1’s Memory Controllers (MC).

Allow me to fill you in…

The Tegra K1 has several ACtivity Monitors (ACM), which are described in the Tegra K1 TRM. I will go into some detail about this now, hoping I do not cross some sort of line when speaking about the contents of the TRM. The ACMs are meant for tracking hardware activity, and feed the Dynamic Voltage and Frequency Scaling (DVFS) with input to select the appropriate frequency level - for example for the Embedded Memory Controller (EMC).

There are two ACM which are interesting.

MC BUSY CYCLES (CPU ONLY) ACM -> Counts busy CPU MC cycles.

MC BUSY CYCLES (GLOBAL) ACM -> Counts all busy cycles (regardless of source, including GPU)

In other words, one can estimate busy GPU MC cycles by subtracting the former from the latter.

I rewrote the Tegra K1 ACM driver to count the total amount of busy cycles. Then, I run some simple CUDA code to read / write DRAM (L2/L1 caches are completely disabled by using the ld.global.cv instrution). In short, I get the following:

GPU memcpy H->D: 100 MB total, 13569024 busy GPU cycles = 7,3 B/c

GPU memcpy D->H: 100 MB total, 12412928 busy GPU cycles = 8,0 B/c

So I present to you: a way to measure DRAM bandwidth on the Tegra K1 :) Enjoy!

Thanks for your report! It’s really helpful.

Topic		Replies	Views
Transposing the grid, but not its thread-blocks, results in slower performance, why ? CUDA Programming and Performance	9	911	March 2, 2018
How to check the occupancy rate of GPU memory? Jetson TK1	1	3721	October 30, 2015
theoretical/real shared/dram peak memory throughput CUDA Programming and Performance	12	4999	January 5, 2017
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37411	August 30, 2009
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	1783	August 8, 2023
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	6995	November 2, 2010
Theoretical ON-CHIP Bandwidth how to determine? CUDA Programming and Performance	15	11589	October 16, 2009
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011
[Jetson-TK1] How to measure DRAM <-> L2 R/W bandwidth on Tegra K1? Jetson TK1	3	1677	August 12, 2015
Using Nsight Compute to Inspect your Kernels Technical Blog	2	1682	August 31, 2020

Measuring DRAM throughput

Related topics