When I using kernel to test latency of L2, I find a weird phenomenon.
I create an array with ARRAY_SIZE elements (float) and test the time to access 1 element in the array using 2 ways:
- initiate the array outside kernel and using cudaMemcpy to move the data to Dram
- initiate the array in kernel using __stcg.
But when I ARRAY_SIZE is over 1 Megabyte (e.g. 2MB), using way1 the latency will higher than the way 2.
way 1: for every 8 elements, 1 element with high latency (>400 cycles) will follow 7 elements with low latency ( about 200 cycles) - seems miss in L2
way 2: accessing every element takes about 200 cycles - seems hit in L2
My question is, is there any limit on L2 cache size used by cudaMemcpy(), like 1 MB (suggested in cudaMemcpy() and L2 cache. - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums ?