Very large instruction replay overhead for random memory access on Kepler

Hello, I am studying the performance of random memory access on a Kepler GPU, K40m. The kernel I use is pretty simple as follows,

__global__ void scatter(int *in1, int *out1, int * loc, const size_t n)
{
	int globalSize = gridDim.x * blockDim.x;
	int globalId = blockDim.x * blockIdx.x + threadIdx.x;

	for (unsigned int i = globalId; i < n; i += globalSize) {
	    int pos = loc[i];	
	    out1[pos] = in1[i];	
	}
}

That is, I will read an array in1 as well as a location array loc. Then I permute in1 according to loc and output to the array out1. Generally, out1[loc[i]] = in1[i]. Note that the location array is sufficiently shuffled and each element is unique.

And I just use the default nvcc compilation setting with -O3 flag opened. The L1 dcache is disabled. Also I fix my # blocks to be 8192 and block size of 1024.

I use nvprof to profile my program. It is easy to know that most of the instructions in the kernel should be memory access. For an instruction of a warp, since each thread demands a discrete 4 Byte data, the instruction should be replayed multiple times (at most 31 times?) and issue multiple memory transactions to fulfill the need of all the threads within the warp. However, the metric “inst_replay_overhead” seems to be confusing: when # tuples n = 16M, the replay overhead is 13.97, which makes sense to me. But when n = 600M, the replay overhead becomes 34.68. Even for larger data, say 700M and 800M, the replay overhead will reach 85.38 and 126.87.

The meaning of “inst_replay_overhead”, according to document, is “Average number of replays for each instruction executed”. Is that mean when n = 800M, on average each instruction executed has been replayed 127 times? How comes the replay time much larger than 31 here? Am I misunderstanding something or am I missing other factors that will also contribute greatly to the replay times? Thanks a lot!