Memory profiling by nvprof does not match EMC value read by tegrastatus

I am trying to do bandwidth test on xavier
I wrote 2 kernels

kernel 1: access a image by row priority, is a coalescing access,gst_efficiency = 100%

__global__ void KernelCpoy1DRow(int cols, int rows, uchar4 *mem1, uchar4 *mem2) {
  int tid = (blockIdx.y * gridDim.x + blockIdx.x) * (blockDim.x * blockDim.y) +
            threadIdx.y * blockDim.x + threadIdx.x;
  if (tid < cols * rows) {
    mem2[tid].x = 255;
    mem2[tid].y = 255;
    mem2[tid].z = 255;
    mem2[tid].w = 255;
  }
}

profiling it by nvprof

10                          gld_transactions                                         Global Load Transactions           0           0           0
10                          gst_transactions                                        Global Store Transactions      145800      145800      145800
10                  sysmem_read_transactions                                  System Memory Read Transactions           7          39          10
10                 sysmem_write_transactions                                 System Memory Write Transactions      145813      145813      145813
10                      l2_read_transactions                                             L2 Read Transactions      214720      219808      216842
10                     l2_write_transactions                                            L2 Write Transactions      145813      145813      145813
10                           global_hit_rate                                Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
10                            local_hit_rate                                                   Local Hit Rate       0.00%       0.00%       0.00%
10                        tex_cache_hit_rate                                           Unified Cache Hit Rate       0.00%       0.00%       0.00%
10                      l2_tex_read_hit_rate                                      L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
10                     l2_tex_write_hit_rate                                     L2 Hit Rate (Texture Writes)       0.00%       0.00%       0.00%
10               l2_local_global_store_bytes         Bytes written to L2 from L1 for local and global stores.     4665600     4665600     4665600
10                    l2_surface_store_bytes           Bytes read from L2 for misses in L1 for surface stores           0           0           0
10                         sysmem_read_bytes                                         System Memory Read Bytes         224        1248         326
10                        sysmem_write_bytes                                        System Memory Write Bytes     4666016     4666016     4666016
10                           l2_tex_hit_rate                                                L2 Cache Hit Rate       0.00%       0.00%       0.00%
10                            gld_efficiency                                    Global Memory Load Efficiency       0.00%       0.00%       0.00%
10                            gst_efficiency                                   Global Memory Store Efficiency     100.00%     100.00%     100.00%

sysmem_write_bytes = 4666016

when i run this kernel 1000times/second(about 4.35G/s write), i use tegrastatus read EMC = 6%@2133:

RAM 3389/15700MB (lfb 600x4MB) CPU [2%@2265,5%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265] EMC_FREQ 6%@2133 GR3D_FREQ 6%@1377 APE 150 MTS fg 0% bg 0% AO@27.5C GPU@29.5C Tboard@28C Tdiode@29.25C AUX@27.5C CPU@29C thermal@28.35C PMIC@100C GPU 1705/1120 CPU 929/639 SOC 2324/2176 CV 0/0 VDDRQ 155/9 SYS5V 2140/2063

kernel 2:.another kernel access a image by column priority, is low efficiency gst_efficiency = 12.5% (image element uchar4)

__global__ void KernelCpoy1DCol(int cols, int rows, uchar4 *mem1, uchar4 *mem2) {
  int tid = (blockIdx.y * gridDim.x + blockIdx.x) * (blockDim.x * blockDim.y) +
            threadIdx.y * blockDim.x + threadIdx.x;

  if (tid < cols * rows) {
    int col = tid / rows;
    int row = tid % rows;

    int idx = row * cols + col;
    mem2[idx].x = 255;
    mem2[idx].y = 255;
    mem2[idx].z = 255;
    mem2[idx].w = 255;
  }
}

profiling it by nvprof

10                          gld_transactions                                         Global Load Transactions           0           0           0
10                          gst_transactions                                        Global Store Transactions     1166400     1166400     1166400
10                  sysmem_read_transactions                                  System Memory Read Transactions           7          63          12
10                 sysmem_write_transactions                                 System Memory Write Transactions     1164916     1165276     1165105
10                      l2_read_transactions                                             L2 Read Transactions      583064      594352      586775
10                     l2_write_transactions                                            L2 Write Transactions     1165012     1165266     1165097
10                           global_hit_rate                                Global Hit Rate in unified l1/tex      49.96%      50.44%      50.22%
10                            local_hit_rate                                                   Local Hit Rate       0.00%       0.00%       0.00%
10                        tex_cache_hit_rate                                           Unified Cache Hit Rate       0.10%       0.13%       0.12%
10                      l2_tex_read_hit_rate                                      L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
10                     l2_tex_write_hit_rate                                     L2 Hit Rate (Texture Writes)      88.60%      88.99%      88.77%
10              l2_global_atomic_store_bytes                   Bytes written to L2 from L1 for global atomics           0           0           0
10               l2_local_global_store_bytes         Bytes written to L2 from L1 for local and global stores.    37273184    37288960    37281254
10                    l2_surface_store_bytes           Bytes read from L2 for misses in L1 for surface stores           0           0           0
10                         sysmem_read_bytes                                         System Memory Read Bytes         224        2016         403
10                        sysmem_write_bytes                                        System Memory Write Bytes    37277312    37288832    37283369
10                           l2_tex_hit_rate                                                L2 Cache Hit Rate      88.60%      88.99%      88.77%
10                            gld_efficiency                                    Global Memory Load Efficiency       0.00%       0.00%       0.00%
10                            gst_efficiency                                   Global Memory Store Efficiency      12.50%      12.50%      12.50%

System Memory Write Bytes = 37283369

when i run this kernel 1000times/second(about 34.7G/s write), i use tegrastatus, EMC = 14%@2133:

RAM 3389/15700MB (lfb 600x4MB) CPU [2%@2265,5%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265] EMC_FREQ 14%@2133 GR3D_FREQ 60%@1377 APE 150 MTS fg 0% bg 0% AO@29C GPU@32.5C Tboard@29C Tdiode@30.5C AUX@28C CPU@30C thermal@29.95C PMIC@100C GPU 6027/6011 CPU 772/772 SOC 2936/2936 CV 0/0 VDDRQ 1082/1082 SYS5V 2503/2503

my question is:

  1. In kernel 2 nvprof result, system Memory Write Transactions = gst_transactions,it seems that L2 cache does not working, dose system Memory Write means dram write?

  2. In kernel 2 nvprof result, system Memory Write Transactions is about 8 time that in kernel 1
    But in tegrastatus EMC value,kernel 2 vs kernel 1 = 14% vs 6%, does not match 8 times.

what is the relationship between nvprof result and EMC。is there a way to calculate EMC from profling result?

thanks

Hi,

1. This depends on what kind of memory type is allocated.
On Jetson, pinned memory is uncached while unified memory is cached.

Here is our document for Jetson’s memory introduction:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html

2. EMC bandwidth includes all sysmem/carve-out/GART memory access.
So you need to take all the memory access into consideration.

Here is our document for tegrastats for your reference:
https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide%2FAppendixTegraStats.html%23wwpID0E0SB0HA

This