L2 cache allocation

Hello everyone,

My goal is to fill the GPU’s L2 cache with an array.

Is there a way to create a variable in a specific hardware memory zone of my GPU with Cuda ?

Does anyone has any suggestions or documents where I can find some information ?

You can be somewhat certain of values residing in:

shared memory (shared )
constant memory (constant )
registers (if no local memory spillage)
texture memory (will depend on cache size on specific hardware).

In the case of L2 you can expect data that you’ve treated a certain way to reside there but I don’t think you will actually know before doing extensive profiling.

Use the PG for reference: Programming Guide :: CUDA Toolkit Documentation

Thank you for your response.

I forgot to mention that I’ve read the PG, and that I know I can allocate data in the different memory zones (global, constant, shared…).

To what I understood of the PG, if I want to fill the L2 cache, I need to allocate my array in the global memory zone.
I also know the exact size of the L2 memory on my GPU.
My problem is that when I try to find out if this array is large enough to fill the L2 memory, giving the execution time of a simple program that write/read in each cell of the array, I fond no significant difference between different array sizes (smaller, equal, or bigger than my cache size).

Do you have any idea that could help me ?

Hi @bibsys

You understood well the part of allocating the array as global. The idea of global is to communicate your SMs (or blocks) by using a common memory space. In principle, the global memory is the main memory of the GPU, but to have fast access to the data, the accesses to the main memory are cached into the L2.

Getting lower bandwidth may be caused by:

  • Non-sequential memory accesses
  • Strided memory access
  • Resource starvation

In the first case, if you have contiguous data, let’s say an array of doubles, and you access them sequentially, it would be the ideal case. You can have sequential memory accesses but strided, which reduces the possibility of having more cached elements.

Considering the case of having your whole array in the cache, it could happen that many SMs may query the L2, leading to a resource starvation condition. So basically, if you are trying to access to the same resource, some SMs needs to wait to get access to the L2.

This part of the documentation is fundamental: Programming Guide :: CUDA Toolkit Documentation

You need to take into account also the cache line, which will guide you to exploiting the cache in a better way, by also using the L1 cache. Now, if your goal is to store an array in the L2 but coding, as far as I know, there is no a way to do it, but just tweaking sizes and accesses.

Regards,
Leon

1 Like

Hi Leon,

Thank you for this information, it really helped !
I’ve managed to find the results I was looking for.

1 Like