cudaMemcpy() and L2 cache.

Mungio · May 17, 2016, 4:04pm

Hi. I have a doubt on cudaMemcpy() fucntion. When I call this function to copy data from Host to Device, this data fill only the Global Memory or also the L2 cache?

Robert_Crovella · May 17, 2016, 9:01pm

ALL accesses to DRAM take place through the L2 cache. And the L2 cache cannot be disabled.

Mungio · May 18, 2016, 8:40am

So also the Host when access to DRAM, access throught L2 cache?

Thanks a lot.

Robert_Crovella · May 18, 2016, 12:40pm

Yes, for GPUs connected to a system by PCI Express. The Jetson TK1 case may be different, I haven’t looked at that carefully.

For a discrete GPU system, you can demonstrate the truth of this fairly easily with a profiler. Write a small application that only loads data within a small working set - something that will fit in the L2 cache, so choose a data set size less than a megabyte, perhaps (will depend on your GPU. I’m not sure I know how to demonstrate this on a TK1).

Do the normal sequence of load data to memory with cudaMemcpy, then call your kernel, etc.

If you profile it, you will find the number of dram load transactions is approximately zero. This is a remarkable result for a kernel that is loading data “for the first time” from global memory, and it is explained by the fact that the previous cudaMemcpy operation “populated” the L2 cache.

Mungio · May 18, 2016, 4:41pm

Thx a lot txbob

Mungio · May 26, 2016, 1:36pm

Hi, I do others experiments on this topic and I found something strange:

I do this on my NVIDIA 960M with 2 MB of L2 cache, using this code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"
#include <stdio.h>
#include <stdlib.h>

#define MEGABYTE 1048576

__global__ void kernel(const int *in, int *out)
{   
	out[0 + threadIdx.x] = in[0 + threadIdx.x];
}

int main(int argc,char* argv[])
{
	
	int *dev_in = 0;
	int *dev_out = 0;
	
	int * in = (int*)malloc(sizeof(int) * MEGABYTE);
	int * out = (int*)malloc(sizeof(int) * MEGABYTE);

	for (int i = 0; i < MEGABYTE; i++)
		in[i] = i;

	cudaMalloc((void**)&dev_in, sizeof(int) * MEGABYTE);
	cudaMalloc((void**)&dev_out, sizeof(int) * MEGABYTE);

	cudaProfilerStart();

	cudaMemcpy(dev_in, in, sizeof(int) * MEGABYTE, cudaMemcpyHostToDevice);
	
	kernel<<<1, 32>>>(dev_in, dev_out);

	cudaMemcpy(out, dev_out, sizeof(int) * MEGABYTE, cudaMemcpyDeviceToHost);

	cudaDeviceSynchronize();
	cudaProfilerStop();
    
    
	free(in);
	free(out);
	cudaFree(dev_in);
	cudaFree(dev_out);
	cudaDeviceReset();

    return 0;
}

In this code I load 4 MB on global Memory and in the kernel I read the first 128 Byte. The L2 Hit Rate is 0% as shown in these figures: https://www.dropbox.com/s/o7hgn6y21nbg8jn/Overview_Misses.PNG?dl=0
https://www.dropbox.com/s/mquivv9expyy9ul/Caches_Miss_1.PNG?dl=0
https://www.dropbox.com/s/gt32efmpbye0mzv/Buffers_Missis.PNG?dl=0

After I edit the code and I read the last 128 Byte

out[(MEGABYTE - 32) + threadIdx.x] = in[(MEGABYTE - 32) + threadIdx.x];

Now the L2 Hit Rate is 100% as shown in these figures:
https://www.dropbox.com/s/ayot3nllw4pubum/Overview_No_Misses.PNG?dl=0
https://www.dropbox.com/s/6i1kk6nkebi4v8y/Caches_No_Miss_1.PNG?dl=0
https://www.dropbox.com/s/ietyewcei63cbyr/Buffers_No_Missis.PNG?dl=0

This experiment seems to confirm what has been said above. I do an other experiment where i deleted the two cudaMemcopy() and I repeted the two cases. How is possible that I do not execute any cudaMemcpy() but I have a 100% L2 Hit Rate? And what are “Buffers”? There are some Load from device and systems that I cannot explain. Figures below shown these results:

https://www.dropbox.com/s/9om1djpdjk0z6cj/Overview_Nocpy_NoMisses.PNG?dl=0
https://www.dropbox.com/s/uffhab7wi582kuy/Buffers_Nocpy_NoMisses.PNG?dl=0

Thanks for any answers.

Robert_Crovella · May 26, 2016, 10:02pm

I’m not entirely sure of the circumstances under which the L2 cache lines are invalidated. You may be assuming that they are invalidated at application start, but I’m not sure that is the case. If you think carefully about GPU concurrency scenarios, you may agree with me.

I see no reason why the hardware would have to invalidate the L2 cache lines if the underlying data has not been touched.

Therefore I think it’s possible that your results with no cudaMemcpy may be possible.

To maximize the likelihood of L2 cache invalid state, I would

Run an app that allocates “all” of available GPU memory, then writes to all of it, sequentially. This will ensure that only the last few megabytes are “in” the L2 cache.
Then run your app. The allocations there are presumably not likely to end up at the “end” of the previously allocated region from step 1, therefore L2 cache should be invalid for that region.

Mungio · May 30, 2016, 2:15pm

I do this experiment: I shutdown my PC and I launch the application without memcpy() function. I try to access to the last 2 MB of global memory and I have 100% Hit Rate of L2 cache!

It’s too strange for me.

Maybe also cudaMalloc() changes something in cache.

qlyu · May 5, 2023, 1:46am

why the data set size is limited in 1 megabyte?

When I using kernel to test latency of L2, I find a weird phenomenon.
I create an array with ARRAY_SIZE elements (float) and test the time to access 1 element in the array using 2 ways:

initiate the array outside kernel and using cudaMemcpy() to move the data to Dram
initiate the array in kernel using __stcg.

But when I ARRAY_SIZE is over 1 Megabyte (e.g. 2MB), the latency is different:
way 1: for every 8 elements, 1 element with high latency (>400 cycles) will follow 7 elements with low latency ( about 200 cycles) - seems miss in L2
way 2: accessing every element takes about 200 cycles - seems hit in L2

My question is， is there any limit on L2 cache size used by cudaMemcpy()?

Robert_Crovella · May 6, 2023, 3:31pm

I’m not aware of any limit.

Topic		Replies	Views
FAO: Nvidia Engineers:- Memory Leak in cudaMemcpyAsync Only occurs on Host To Device memory transfer CUDA Programming and Performance	4	5870	August 18, 2010
Understanding Caching/Flushing Behavior/Performance in computeprof for Kepler CUDA Programming and Performance	6	3321	September 19, 2014
cudaMemcpy2D() and a few gray hairs It's very slow CUDA Programming and Performance	8	4533	February 13, 2009
What the difference using L2 by cudaMemcpy() and from kernel? General Topics and Other SDKs	1	457	May 4, 2023
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3911	January 28, 2022
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1346	August 28, 2019
Understanding the functioning of nvprof and .cv data load option CUDA Programming and Performance	8	3047	December 11, 2014
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4027	December 23, 2015
2MB size limit cudaMemcpy CUDA Programming and Performance	18	7290	April 18, 2011
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	497	March 30, 2024

cudaMemcpy() and L2 cache.

Related topics