As I have been interested in the cache access times too, I wanted to check your measurements.
So here are my results measured on a GeForce GTX 560 Ti and Quadro 6000 for which I got the same results (in cycles cleaned from overhead):
1060 non-cached
248 L2
18 L1
I also tried to access volatile data, which seems to use the L2 cache.
I furthermore measured the time for using atomicAdd, which also seems to use the L2 cache.
For this test I increased the number of atomic operations on the same dataword using a modulo operation on the thread index:
1390 atomic non-cached
580 atomic L2
1380 atomic non-cached 2x conflict
570 atomic L2 2x conflict
1490 atomic non-cached 4x conflict
600 atomic L2 4x conflict
1900 atomic non-cached 8x conflict
1145 atomic L2 8x conflict
2980 atomic non-cached 16x conflict
2198 atomic L2 16x conflict
4680 atomic non-cached 32x conflict
3910 atomic L2 32x conflict
Here is the code if you want to compare it:
__global__ void timerFunction(float * global, clock_t * timings)
{
__shared__ volatile float shared[32];
shared[threadIdx.x] = 0;
__syncthreads();
clock_t t[2*RUNS];
for(uint i = 0; i < RUNS; ++i)
{
__syncthreads();
t[2*i] = clock();
shared[threadIdx.x] = global[threadIdx.x];
//shared[threadIdx.x] = threadIdx.x; //computing the overhead
//shared[threadIdx.x] = atomicAdd(global+(threadIdx.x%CONFLICTMODULO), 1.0f); // atomic test
t[2*i+1] = clock();
__syncthreads();
}
__syncthreads();
for(uint i = 0; i < 2*RUNS; ++i)
timings[threadIdx.x*2*RUNS + i] = t[i];
}
greets