What the difference using L2 by cudaMemcpy() and from kernel?

When I using kernel to test latency of L2, I find a weird phenomenon.
I create an array with ARRAY_SIZE elements (float) and test the time to access 1 element in the array using 2 ways:

  1. initiate the array outside kernel and using cudaMemcpy to move the data to Dram
  2. initiate the array in kernel using __stcg.
    But when I ARRAY_SIZE is over 1 Megabyte (e.g. 2MB), using way1 the latency will higher than the way 2.
    way 1: for every 8 elements, 1 element with high latency (>400 cycles) will follow 7 elements with low latency ( about 200 cycles) - seems miss in L2
    way 2: accessing every element takes about 200 cycles - seems hit in L2

My question is, is there any limit on L2 cache size used by cudaMemcpy(), like 1 MB (suggested in cudaMemcpy() and L2 cache. - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums ?

Hi there @qlyu and welcome to the NVIDIA developer forums!

The way cudaMemcpy() behaves in terms of cache usage might be proprietary to our driver and not something to disclose. But you might want to ask the same in the forum category where you found the original post, CUDA programming, there you will find the matter experts.

Another option is to check out our CUDA developer tools, including CUPIT that can be used to profile your CUDA app.

I hope that helps.