Hi everyone,
I’m looking at how the performance counters change if we vary amount of math/mem accesses in the kernel. However, dram_reads somehow doesn’t match and I’d like to know why…
The source code looks like this:
__global__ void test_kernel(int *gin, int aPitch, int *gout, int kernelId){
int offset = kernelId * aPitch;
int gIdx = aPitch * (blockDim.y * blockIdx.y + threadIdx.y) + blockDim.x * blockIdx.x + threadIdx.x;
int in0 = gin[offset + gIdx];
for(int i=0; i<WORK; i++) {
in0 = (in0+i)*in0 + i;
}
gout[offset + gIdx] = in0;
}
Following the GTC10 presentation on analysis-driven optimization by Paulius, I modified the kernel into math-only and mem-only version as follows:
__global__ void test_kernel_memonly(int *gin, int aPitch, int *gout, int kernelId){
int offset = kernelId * aPitch;
int gIdx = aPitch * (blockDim.y * blockIdx.y + threadIdx.y) + blockDim.x * blockIdx.x + threadIdx.x;
int in0 = gin[offset + gIdx];
gout[offset + gIdx] = in0;
}
__global__ void test_kernel_mathonly(int *gin,int aPitch, int *gout, int kernelId, int flag){ //flag==0
int in0 = 1;
for(int i=0; i<WORK; i++) {
in0 = (in0+i)*in0 + i;
}
if (1 == flag*in0) {
gout[offset + gIdx] = in0;
}
}
The kernel is launched with a single TB of 256 threads, WORK=1000. Here’s what we get on Tesla C2050, CUDA4.0RC, and computeprof 3.2:
full vs mem-only:
block size (256, 1, 1 ), occupancy: 0.167 (both, registers=5 (full) → 4 (memonly),
gld_request=8 (both), l1_global_load_miss=8 (both), global_store_transaction=8 (both), dram_reads: ~4294(full kernel) → ~2204 (memonly), dram_writes=52 (both).
#instructions issued ~ 40160 (full) → 144 (memonly)
full vs math-only:
block size (256, 1, 1 ), occupancy: 0.167 (both, registers=5 (both),
gld_request=0 (both), l1_global_load_miss=0 (both), global_store_transaction=0 (both), dram_reads: ~4294(full kernel) → ~4552 (mathonly), dram_writes=52 (full) → 13(mathonly).
#instructions issued ~ 40160 (full) → 40064 (mathonly)
How is dram_reads related to gld_request and l1_global_load_miss counters? What’s the formula to estimate dram_reads?
How come there’s a 50% drop in dram_reads for memonly kernel? On the other hand, for mathonly the value is the same like for the original, while I’d expect 0, and gld counters are 0…there’s also a large difference between dram_reads and dram_writes values, while each thread is reading 1 data element and writing one data element. What could be the reason for this?
As a next step, I wanted to transform the original kernel into a bandwidth-bound kernel in order to make some further experiments.
I’ve been wondering if there’s a simple trick to vary the character of a kernel from bandwidth-bound to instruction-bound with a single parameter (or two)?
Looking forward to ideas! :)