Hi everyone,
I’m trying to understand why this simple example has a high instruction replay overhead.
This is the kernel:
__global__ void daxpy(double a, double *x, double *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
{
y[i] += a*x[i];
}
}
This is how the kernel is launched:
#define N 2048*14*1
...
int blocksize = 1024;
int gridsize = int((N+blocksize-1)/blocksize);
...
for(int i=0; i<10000; i++)
{
daxpy<<<gridsize,blocksize>>>(a, dx, dy);
cudaThreadSynchronize();
}
This is the output from nvprof:
nvprof --print-gpu-trace --metrics inst_replay_overhead,global_replay_overhead,local_replay_overhead ./a.out --benchmark
==21232== NVPROF is profiling process 21232, command: ./a.out --benchmark
==21232== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==21232== Profiling application: ./a.out --benchmark
==21232== Profiling result:
Device Context Stream Kernel Instruction Replay Overhead Global Memory Replay Overhead Local Memory Cache Replay Overhead
Tesla K20Xm (0) 1 7 daxpy(double, double 0.999343 0.176471 0.000000
Tesla K20Xm (0) 1 7 daxpy(double, double 0.881696 0.176471 0.000000
Tesla K20Xm (0) 1 7 daxpy(double, double 0.884585 0.176471 0.000000
Tesla K20Xm (0) 1 7 daxpy(double, double 0.878414 0.176471 0.000000
Tesla K20Xm (0) 1 7 daxpy(double, double 0.880909 0.176471 0.000000
Tesla K20Xm (0) 1 7 daxpy(double, double 0.874540 0.176471 0.000000
None of the possible causes I have read about seem to make sense for this example.
Any ideas?
When changing from doubles to floats, instruction replay drops to about .3.
Additionally, does anyone know why instruction replay is systematically higher for the first kernel launched?
Thank you very much!