Strided local/global/generic memory accesses on Kepler

Hi everyone,

I’m playing around with cache microbenchmarking on my GK104. I set out to determine the latencies for the memory hierarchy levels using a standard approach of a single thread chasing a pointer through memory for various stride lengths and memory sizes. This works well for L2 and global memory, but turns out to be difficult for the L1, as Kepler is using the L1 for thread-local memory accesses only.

Thus, I crafted a kernel where a thread allocates a large local array, copies in the precomputed stride data from global memory, and starts striding on the local array, like so:

__global__
void cudaGenericMemStride(int* array, unsigned elems, long long int* time) {
	int k = 0;
	const unsigned repeats = LOOP_REPEATS;
	int localArray[LOCAL_MEM_ELEMS];
	for (unsigned i = 0; i < elems && i < LOCAL_MEM_ELEMS; i++) {
		localArray[i] = array[i];
	}
	
	long long int start = clock64();

	for (unsigned i = 0; i < repeats; i++) {
		repeat(UNROLL_REPEATS, k = localArray[k]; )
	}

	long long int end = clock64();

	if (k < 0) {
		*time = -1;
	} else {
		*time =  start < end ? end-start : end + (0xffffffffffffffff - start);
	}
}

The LOCAL_MEM_ELEMS macro is defined to be the largest data size used for testing, so that the global memory data being copied in always fits into the local memory array. I checked the SASS code generated with cuobjdump --dump-sass, and I get a long sequence of dependent LDL instructions, just as expected.

Now here comes the problem: While this kernel runs beautifully, the timing output is complete garbage. The timing that is reported seems to be completely independent from the stride size, it is way lower than what I would expect, and there is almost no variance in the execution time. Here is part of the output as an example:

arraysize,stride in bytes, grid, block, cumulative latency, number of accesses
50176,8,1,1,4.72467e+10,32768
50176,16,1,1,4.72467e+10,32768
50176,32,1,1,4.72467e+10,32768
50176,64,1,1,4.72467e+10,32768
50176,128,1,1,4.72467e+10,32768
50176,256,1,1,4.72467e+10,32768
50176,512,1,1,4.72467e+10,32768
50176,1024,1,1,4.72467e+10,32768
50176,2048,1,1,4.72467e+10,32768
50176,4096,1,1,4.72467e+10,32768
50176,8192,1,1,4.72467e+10,32768
50176,16384,1,1,4.72467e+10,32768
50176,32768,1,1,4.72467e+10,32768
50176,65536,1,1,4.72467e+10,32768
50176,131072,1,1,4.72467e+10,32768
50176,262144,1,1,4.72467e+10,32768
50176,524288,1,1,4.72467e+10,32768
50176,1048576,1,1,4.72467e+10,32768
50176,2097152,1,1,4.72467e+10,32768
50176,4194304,1,1,4.72467e+10,32768
50176,8388608,1,1,4.72467e+10,32768
50176,16777216,1,1,4.72467e+10,32768
50176,33554432,1,1,4.72467e+10,32768
50176,67108864,1,1,4.72467e+10,32768
50176,134217728,1,1,4.72467e+10,32768
50176,268435456,1,1,4.72467e+10,32768
235520,8,1,1,4.72467e+10,32768
235520,16,1,1,4.72467e+10,32768
235520,32,1,1,4.72467e+10,32768
235520,64,1,1,4.72467e+10,32768
235520,128,1,1,4.72467e+10,32768
235520,256,1,1,4.72467e+10,32768
235520,512,1,1,4.72467e+10,32768
235520,1024,1,1,4.72467e+10,32768
235520,2048,1,1,4.72467e+10,32768
235520,4096,1,1,4.72467e+10,32768
235520,8192,1,1,4.72467e+10,32768
235520,16384,1,1,4.72467e+10,32768
235520,32768,1,1,4.72467e+10,32768
235520,65536,1,1,4.72467e+10,32768
235520,131072,1,1,4.72467e+10,32768
235520,262144,1,1,4.72467e+10,32768
235520,524288,1,1,4.72467e+10,32768
235520,1048576,1,1,4.72467e+10,32768
235520,2097152,1,1,4.72467e+10,32768
235520,4194304,1,1,4.72467e+10,32768
235520,8388608,1,1,4.72467e+10,32768
235520,16777216,1,1,4.72467e+10,32768
235520,33554432,1,1,4.72467e+10,32768
235520,67108864,1,1,4.72467e+10,32768
235520,134217728,1,1,4.72467e+10,32768
235520,268435456,1,1,4.72467e+10,32768

I have a similar kernel for global memory testing that works perfectly. Does anyone have an idea what is going awry with the local memory approach? Any help is greatly appreciated.

Thanks,
Michael