Unexplained high instruction replay

Hi everyone,

I’m trying to understand why this simple example has a high instruction replay overhead.

This is the kernel:

__global__ void daxpy(double a, double *x, double *y)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	if (i < N)
        {
		y[i] += a*x[i];
	}
}

This is how the kernel is launched:

#define N 2048*14*1
...

        int blocksize = 1024;
	int gridsize = int((N+blocksize-1)/blocksize);
...
        for(int i=0; i<10000; i++)
	{
		daxpy<<<gridsize,blocksize>>>(a, dx, dy);
		cudaThreadSynchronize();
	}

This is the output from nvprof:

nvprof --print-gpu-trace --metrics inst_replay_overhead,global_replay_overhead,local_replay_overhead  ./a.out --benchmark 
==21232== NVPROF is profiling process 21232, command: ./a.out --benchmark
==21232== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==21232== Profiling application: ./a.out --benchmark
==21232== Profiling result:
         Device          Context           Stream                Kernel  Instruction Replay Overhead  Global Memory Replay Overhead  Local Memory Cache Replay Overhead

Tesla K20Xm (0)                1                7  daxpy(double, double                     0.999343                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.881696                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.884585                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.878414                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.880909                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.874540                       0.176471                            0.000000

None of the possible causes I have read about seem to make sense for this example.
Any ideas?

When changing from doubles to floats, instruction replay drops to about .3.

Additionally, does anyone know why instruction replay is systematically higher for the first kernel launched?

Thank you very much!

Cache misses?

Don’t know… I haven’t looked into replay issues… but I did see something about restrict perhaps some new feature in cuda 7, or perhaps it existed earlier… this will tell compiler that pointer aliasing won’t happen… but you must make sure that it indeed doesn’t happen.
So if you sure y[i] and x[i] don’t overlap… then restrict can give higher performance… by preventing replays. Newer gfx cards might also have special caches which can be used if restrict specified.

Hi tera,

Thanks for your input.
I disabled the cache with “-Xptxas -dlcm=cs”

But I still get high instruction replay:

==302== NVPROF is profiling process 302, command: ./a.out --benchmark
==302== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==302== Profiling application: ./a.out --benchmark
==302== Profiling result:
         Device          Context           Stream                Kernel  Instruction Replay Overhead  Global Memory Replay Overhead  Local Memory Cache Replay Overhead

Tesla K20Xm (0)                1                7  daxpy(double, double                     0.858423                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.760143                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.759979                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.750853                       0.176471                            0.000000
Tesla K20Xm (0)                1                7  daxpy(double, double                     0.783646                       0.176471                            0.000000

@Skybuck: Thanks for your answer. However, I am not trying to improve the performance of this code, I would just like to understand what’s happening.

The general policy for the NVIDIA tools is to only communicate data through the profilers that are actionable by the developer. As such the tools do not provide sufficient information for you to determine why inst_replay_overhead is so high.

I do not have a GK110 in my system but I am positive that your kernel is memory latency limited. If you were to increase the number of blocks you would become memory throughput limited. The number of inst_issued is not a blocker for this kernel so you shouldn’t be trying to optimize the number of instructions issued. If the plan was to increase the ALU complexity of the kernel then it would be helpful to understand why the value is high.

SASS for SM35 for the code provided. In my example N was passed as a parameter so the SASS may be slightly different.

inst_executed   inst_issued
1        MOV R1, c[0x0][0x44]                  1               1
2        S2R R0, SR_CTAID.X                    1               1
3        S2R R3, SR_TID.X                      1               1
4        IMAD R0, R0, c[0x0][0x28], R3         1               1
5        ISETP.GT.AND P0, PT, R0, 0x77ff, PT   1               1
6  @P0   BRA.U 0x90                            1               1
7  @!P0  MOV32I R3, 0x8                        1               1
8  @!P0  IMAD R8.CC, R0, R3, c[0x0][0x140]     1               1
9  @!P0  IMAD.HI.X R9, R0, R3, c[0x0][0x144]   1               1
10 @!P0  IMAD R6.CC, R0, R3, c[0x0][0x148]     1               1
11 @!P0  LD.E.CG.64 R4, [R8]                   1               2 transactions
                                                               2 l1 miss
12 @!P0  IMAD.HI.X R7, R0, R3, c[0x0][0x14c]   1               1
13 @!P0  LD.E.CG.64 R2, [R6]                   1               2 transactions
                                                               2 l1 miss
14 @!P0  DFMA R2, R4, c[0x0][0x150]            1               2 to issue a DMUL, DADD, DFMA
15 @!P0  ST.E.64 [R6], R2                      1               2 transactions
16       MOV RZ, RZ                            1               1
17       EXIT                                  1               1
         BRA 0xa0
TOTAL                                           17             >=25

For analysis I will list the instructions issued per warp. The code example currently launches 2 thread blocks of 32 warps (1024 threads) per SM so the kernel has only 1 wave of thread blocks.

Per warp the kernel has

  • 17 instructions executed
  • =25 instructions issued (see column inst_issued above)

.88 inst_replay_overhead implies ~15 additional instruction issues per warp.

The first set of extra instruction issues are easy to explain and predict.

  • 1 due to the DFMA. On gk110 DADD, DFMA, and DMUL issue over 2 cycles.
  • 2 extra LD.E.64 issues due to address divergence.

The memory access pattern for the two LD.E.64 instructions is that each thread accesses the next consecutive double. Each LD has an additional instruction issued to cover the required 256 bytes of data.

  • 4 extra LD.E.64 issues due to L1 cache miss.

On Kepler uncache or cached access can still have a cache miss (not intuitive). If the access is uncached or a cache miss the request is put in a request table. When the data returns from L2 a message is sent to the scheduler to issue again to collect the data.

  • 1 extra ST.E.64 issue due to address divergence. see the LD.E.64 explanation.

The global_replay_overhead is calculated as 3/17 = 17.6%. This only accounts for the 3 address divergence replays. The cache miss grants are not counted.

The remaining issues are harder to understand, predict, or resolve so I will not provide a clean breakdown how many each occur as I would need a GK110 to get the value.

The additional issues are due to

  • Instruction Cache Misses

The example is currently launches 2 threads blocks of 32 warps per SM.
The example is very small and the initial instructions execute very fast.
Many of the warps will be affected by the initial i-cache miss. Some warps may hit a second i-cache miss.
If the kernel launched multiple waves of thread blocks this number would be reduced.
The first run of the kernel probably gets hit the worst as the constants and/or instructions may not be in L2 yet so the latency would be longer and affect more warps.

  • Constant Cache Misses

The first MOV and instructions 8, 9, 10, 12, and 14 have immediate constants.
The first few warps to hit these instructions per SM will have to replay due to the constant miss.
As a programmer there is nothing you can do about these misses. If you were defined “constant” data then you want to declare mdoule constant variables to help the compiler group the constants in the same cache lines.
If you were to launch many waves this number would drop off as the constant cache is primed after the first few warps.

  • LSU Buffers Full

The L1 cache can queue up N writes and M pending loads. Once these are full additional issues to L1 will need to be replayed.
In this example you are issuing 4 load transactions and 2 store transactions per warp for a per SM total of 256 and 128 respectively. There is very little ALU instructions per warp so it is very likely this program is exceeding those buffers.

We need more forum posts like this! Actual technical responses with details are a god-send and info such as a cache miss results in >=2 issues due to reissuing is very important to know.

Hi Greg,

Thank you very much for your clear explanation.
My original concern was that actual performances were worse than what my model predicted by a factor that was similar to the instruction replay rate. I wanted to know if that was where the discrepancy came from.
Your post not only explains why I had this discrepancy but also why performance increased more than linearly when feeding more blocks per SM.