Wanted to know about setting aside L2 cache memory

Hi all, I want to get more information about why there is a need for setting aside a portion of L2 cache memory size for accessing global memory region.
I also want to know why this is required particularly when we use cuda streams and cuda graph.
Please refer to the section 3.2.3.1. L2 cache Set-Aside for Persisting Accesses and subsequent sections from the link about CUDA
https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Thanks and Regards

Nagaraj Trivedi

Hi,

You can customize the l2 cache usage in a certain level based on your use case.
For example, the data read only once and read multiple times can have a different cache strategy.

cudaStreamAttrValue stream_attribute;                                         // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr); // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // Number of bytes for persistence access.
                                                                              // (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)

When a kernel subsequently executes in CUDA stream , memory accesses within the global memory extent [ptr..ptr+num_bytes) are more likely to persist in the L2 cache than accesses to other global memory locations.

Below is the extra info we query from cudaGetDeviceProperties() for your reference:

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    8.7
  ...
  l2CacheSize/persistingL2CacheMaxSize/accessPolicyMaxWindowSize:   4194304 / 3145728 / 134213632
...

Thanks.

Hi, I read this information in the document. But my clarifications are different.

  1. Is it going to provide a faster access to global memory when its contents are present/persist in L2 cache?
  2. How it is useful w.r.t CUDA Graph
  3. Also let me know what is the difference between these two statements. Explain taking a practical example. Assume that the kernels part of stream also access global variables.
    stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // Type of access property on cache hit
    stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; // Type of access property on cache miss.

Please clarify.

Thanks and Regards

Nagaraj Trivedi

Hi,

1. L2 cache is faster than global memory.
2. If certain data in the CUDA graph will be read multiple times, moving it to the L2 cache can reduce the latency.

3. For example:

    node_attribute.accessPolicyWindow.hitRatio = 0.6;                        

This indicates the 60% data in [ptr…ptr+num_bytes) is persisting and 40% is streaming.
Persisting means the data will be read multiple times and streaming is used for only one-time access.

Thanks.