cudaMemcpy() and L2 cache.

Hi. I have a doubt on cudaMemcpy() fucntion. When I call this function to copy data from Host to Device, this data fill only the Global Memory or also the L2 cache?

ALL accesses to DRAM take place through the L2 cache. And the L2 cache cannot be disabled.

So also the Host when access to DRAM, access throught L2 cache?

Thanks a lot.

Yes, for GPUs connected to a system by PCI Express. The Jetson TK1 case may be different, I haven’t looked at that carefully.

For a discrete GPU system, you can demonstrate the truth of this fairly easily with a profiler. Write a small application that only loads data within a small working set - something that will fit in the L2 cache, so choose a data set size less than a megabyte, perhaps (will depend on your GPU. I’m not sure I know how to demonstrate this on a TK1).

Do the normal sequence of load data to memory with cudaMemcpy, then call your kernel, etc.

If you profile it, you will find the number of dram load transactions is approximately zero. This is a remarkable result for a kernel that is loading data “for the first time” from global memory, and it is explained by the fact that the previous cudaMemcpy operation “populated” the L2 cache.

Thx a lot txbob

Hi, I do others experiments on this topic and I found something strange:

I do this on my NVIDIA 960M with 2 MB of L2 cache, using this code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"
#include <stdio.h>
#include <stdlib.h>

#define MEGABYTE 1048576

__global__ void kernel(const int *in, int *out)
{   
	out[0 + threadIdx.x] = in[0 + threadIdx.x];
}

int main(int argc,char* argv[])
{
	
	int *dev_in = 0;
	int *dev_out = 0;
	
	int * in = (int*)malloc(sizeof(int) * MEGABYTE);
	int * out = (int*)malloc(sizeof(int) * MEGABYTE);

	for (int i = 0; i < MEGABYTE; i++)
		in[i] = i;

	cudaMalloc((void**)&dev_in, sizeof(int) * MEGABYTE);
	cudaMalloc((void**)&dev_out, sizeof(int) * MEGABYTE);

	cudaProfilerStart();

	cudaMemcpy(dev_in, in, sizeof(int) * MEGABYTE, cudaMemcpyHostToDevice);
	
	kernel<<<1, 32>>>(dev_in, dev_out);

	cudaMemcpy(out, dev_out, sizeof(int) * MEGABYTE, cudaMemcpyDeviceToHost);

	cudaDeviceSynchronize();
	cudaProfilerStop();
    
    
	free(in);
	free(out);
	cudaFree(dev_in);
	cudaFree(dev_out);
	cudaDeviceReset();

    return 0;
}

In this code I load 4 MB on global Memory and in the kernel I read the first 128 Byte. The L2 Hit Rate is 0% as shown in these figures: https://www.dropbox.com/s/o7hgn6y21nbg8jn/Overview_Misses.PNG?dl=0
https://www.dropbox.com/s/mquivv9expyy9ul/Caches_Miss_1.PNG?dl=0
https://www.dropbox.com/s/gt32efmpbye0mzv/Buffers_Missis.PNG?dl=0

After I edit the code and I read the last 128 Byte

out[(MEGABYTE - 32) + threadIdx.x] = in[(MEGABYTE - 32) + threadIdx.x];

Now the L2 Hit Rate is 100% as shown in these figures:
https://www.dropbox.com/s/ayot3nllw4pubum/Overview_No_Misses.PNG?dl=0
https://www.dropbox.com/s/6i1kk6nkebi4v8y/Caches_No_Miss_1.PNG?dl=0
https://www.dropbox.com/s/ietyewcei63cbyr/Buffers_No_Missis.PNG?dl=0

This experiment seems to confirm what has been said above. I do an other experiment where i deleted the two cudaMemcopy() and I repeted the two cases. How is possible that I do not execute any cudaMemcpy() but I have a 100% L2 Hit Rate? And what are “Buffers”? There are some Load from device and systems that I cannot explain. Figures below shown these results:

https://www.dropbox.com/s/9om1djpdjk0z6cj/Overview_Nocpy_NoMisses.PNG?dl=0
https://www.dropbox.com/s/uffhab7wi582kuy/Buffers_Nocpy_NoMisses.PNG?dl=0

Thanks for any answers.

I’m not entirely sure of the circumstances under which the L2 cache lines are invalidated. You may be assuming that they are invalidated at application start, but I’m not sure that is the case. If you think carefully about GPU concurrency scenarios, you may agree with me.

I see no reason why the hardware would have to invalidate the L2 cache lines if the underlying data has not been touched.

Therefore I think it’s possible that your results with no cudaMemcpy may be possible.

To maximize the likelihood of L2 cache invalid state, I would

  1. Run an app that allocates “all” of available GPU memory, then writes to all of it, sequentially. This will ensure that only the last few megabytes are “in” the L2 cache.

  2. Then run your app. The allocations there are presumably not likely to end up at the “end” of the previously allocated region from step 1, therefore L2 cache should be invalid for that region.

I do this experiment: I shutdown my PC and I launch the application without memcpy() function. I try to access to the last 2 MB of global memory and I have 100% Hit Rate of L2 cache!

It’s too strange for me.

Maybe also cudaMalloc() changes something in cache.

why the data set size is limited in 1 megabyte?

When I using kernel to test latency of L2, I find a weird phenomenon.
I create an array with ARRAY_SIZE elements (float) and test the time to access 1 element in the array using 2 ways:

  1. initiate the array outside kernel and using cudaMemcpy() to move the data to Dram
  2. initiate the array in kernel using __stcg.

But when I ARRAY_SIZE is over 1 Megabyte (e.g. 2MB), the latency is different:
way 1: for every 8 elements, 1 element with high latency (>400 cycles) will follow 7 elements with low latency ( about 200 cycles) - seems miss in L2
way 2: accessing every element takes about 200 cycles - seems hit in L2

My question is, is there any limit on L2 cache size used by cudaMemcpy()?

I’m not aware of any limit.