I am trying to do bandwidth test on xavier
I wrote 2 kernels
kernel 1: access a image by row priority, is a coalescing access,gst_efficiency = 100%
__global__ void KernelCpoy1DRow(int cols, int rows, uchar4 *mem1, uchar4 *mem2) {
int tid = (blockIdx.y * gridDim.x + blockIdx.x) * (blockDim.x * blockDim.y) +
threadIdx.y * blockDim.x + threadIdx.x;
if (tid < cols * rows) {
mem2[tid].x = 255;
mem2[tid].y = 255;
mem2[tid].z = 255;
mem2[tid].w = 255;
}
}
profiling it by nvprof
10 gld_transactions Global Load Transactions 0 0 0
10 gst_transactions Global Store Transactions 145800 145800 145800
10 sysmem_read_transactions System Memory Read Transactions 7 39 10
10 sysmem_write_transactions System Memory Write Transactions 145813 145813 145813
10 l2_read_transactions L2 Read Transactions 214720 219808 216842
10 l2_write_transactions L2 Write Transactions 145813 145813 145813
10 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
10 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
10 tex_cache_hit_rate Unified Cache Hit Rate 0.00% 0.00% 0.00%
10 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
10 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
10 l2_local_global_store_bytes Bytes written to L2 from L1 for local and global stores. 4665600 4665600 4665600
10 l2_surface_store_bytes Bytes read from L2 for misses in L1 for surface stores 0 0 0
10 sysmem_read_bytes System Memory Read Bytes 224 1248 326
10 sysmem_write_bytes System Memory Write Bytes 4666016 4666016 4666016
10 l2_tex_hit_rate L2 Cache Hit Rate 0.00% 0.00% 0.00%
10 gld_efficiency Global Memory Load Efficiency 0.00% 0.00% 0.00%
10 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
sysmem_write_bytes = 4666016
when i run this kernel 1000times/second(about 4.35G/s write), i use tegrastatus read EMC = 6%@2133:
RAM 3389/15700MB (lfb 600x4MB) CPU [2%@2265,5%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265] EMC_FREQ 6%@2133 GR3D_FREQ 6%@1377 APE 150 MTS fg 0% bg 0% AO@27.5C GPU@29.5C Tboard@28C Tdiode@29.25C AUX@27.5C CPU@29C thermal@28.35C PMIC@100C GPU 1705/1120 CPU 929/639 SOC 2324/2176 CV 0/0 VDDRQ 155/9 SYS5V 2140/2063
kernel 2:.another kernel access a image by column priority, is low efficiency gst_efficiency = 12.5% (image element uchar4)
__global__ void KernelCpoy1DCol(int cols, int rows, uchar4 *mem1, uchar4 *mem2) {
int tid = (blockIdx.y * gridDim.x + blockIdx.x) * (blockDim.x * blockDim.y) +
threadIdx.y * blockDim.x + threadIdx.x;
if (tid < cols * rows) {
int col = tid / rows;
int row = tid % rows;
int idx = row * cols + col;
mem2[idx].x = 255;
mem2[idx].y = 255;
mem2[idx].z = 255;
mem2[idx].w = 255;
}
}
profiling it by nvprof
10 gld_transactions Global Load Transactions 0 0 0
10 gst_transactions Global Store Transactions 1166400 1166400 1166400
10 sysmem_read_transactions System Memory Read Transactions 7 63 12
10 sysmem_write_transactions System Memory Write Transactions 1164916 1165276 1165105
10 l2_read_transactions L2 Read Transactions 583064 594352 586775
10 l2_write_transactions L2 Write Transactions 1165012 1165266 1165097
10 global_hit_rate Global Hit Rate in unified l1/tex 49.96% 50.44% 50.22%
10 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
10 tex_cache_hit_rate Unified Cache Hit Rate 0.10% 0.13% 0.12%
10 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
10 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 88.60% 88.99% 88.77%
10 l2_global_atomic_store_bytes Bytes written to L2 from L1 for global atomics 0 0 0
10 l2_local_global_store_bytes Bytes written to L2 from L1 for local and global stores. 37273184 37288960 37281254
10 l2_surface_store_bytes Bytes read from L2 for misses in L1 for surface stores 0 0 0
10 sysmem_read_bytes System Memory Read Bytes 224 2016 403
10 sysmem_write_bytes System Memory Write Bytes 37277312 37288832 37283369
10 l2_tex_hit_rate L2 Cache Hit Rate 88.60% 88.99% 88.77%
10 gld_efficiency Global Memory Load Efficiency 0.00% 0.00% 0.00%
10 gst_efficiency Global Memory Store Efficiency 12.50% 12.50% 12.50%
System Memory Write Bytes = 37283369
when i run this kernel 1000times/second(about 34.7G/s write), i use tegrastatus, EMC = 14%@2133:
RAM 3389/15700MB (lfb 600x4MB) CPU [2%@2265,5%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265,0%@2265] EMC_FREQ 14%@2133 GR3D_FREQ 60%@1377 APE 150 MTS fg 0% bg 0% AO@29C GPU@32.5C Tboard@29C Tdiode@30.5C AUX@28C CPU@30C thermal@29.95C PMIC@100C GPU 6027/6011 CPU 772/772 SOC 2936/2936 CV 0/0 VDDRQ 1082/1082 SYS5V 2503/2503
my question is:
-
In kernel 2 nvprof result, system Memory Write Transactions = gst_transactions,it seems that L2 cache does not working, dose system Memory Write means dram write?
-
In kernel 2 nvprof result, system Memory Write Transactions is about 8 time that in kernel 1
But in tegrastatus EMC value,kernel 2 vs kernel 1 = 14% vs 6%, does not match 8 times.
what is the relationship between nvprof result and EMC。is there a way to calculate EMC from profling result?
thanks