What the difference using L2 by cudaMemcpy() and from kernel?

qlyu · May 4, 2023, 3:35am

When I using kernel to test latency of L2, I find a weird phenomenon.
I create an array with ARRAY_SIZE elements (float) and test the time to access 1 element in the array using 2 ways:

initiate the array outside kernel and using cudaMemcpy to move the data to Dram
initiate the array in kernel using __stcg.
But when I ARRAY_SIZE is over 1 Megabyte (e.g. 2MB), using way1 the latency will higher than the way 2.
way 1: for every 8 elements, 1 element with high latency (>400 cycles) will follow 7 elements with low latency ( about 200 cycles) - seems miss in L2
way 2: accessing every element takes about 200 cycles - seems hit in L2

My question is， is there any limit on L2 cache size used by cudaMemcpy(), like 1 MB (suggested in cudaMemcpy() and L2 cache. - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums ?

MarkusHoHo · May 4, 2023, 10:51am

Hi there @qlyu and welcome to the NVIDIA developer forums!

The way cudaMemcpy() behaves in terms of cache usage might be proprietary to our driver and not something to disclose. But you might want to ask the same in the forum category where you found the original post, CUDA programming, there you will find the matter experts.

Another option is to check out our CUDA developer tools, including CUPIT that can be used to profile your CUDA app.

I hope that helps.

Topic		Replies	Views
cudaMemcpy() and L2 cache. CUDA Programming and Performance	9	3366	May 6, 2023
cudaMemcpy2DAsync long latency CUDA Programming and Performance	1	906	June 30, 2013
cudaMemcpy Strange behaviour CUDA Programming and Performance	2	1407	April 8, 2010
Measuring cache access latency CUDA Programming and Performance	0	406	November 1, 2019
Question about texture memory CUDA Programming and Performance	3	4443	May 27, 2009
Same kernel and data exhibits different performance CUDA Programming and Performance	3	480	December 3, 2021
Kernel faster in double precision than in simple ? CUDA Programming and Performance	4	1015	April 14, 2012
Question about kernel granularity CUDA Programming and Performance	5	1207	March 22, 2017
Cache line size of L1 and L2 CUDA Programming and Performance	3	20137	November 14, 2011
cudamemcpy timings vary over iteration CUDA Programming and Performance	0	995	December 17, 2012

What the difference using L2 by cudaMemcpy() and from kernel?

Related topics