APIs for splitting shared memory and L1 cache

qwerty00 · May 28, 2024, 6:08am

Hi, I recently noticed that there are 3 APIs for the partition of shared memory and L1:

__global__ void ptr_kernel() {
    //...
}

cudaFuncSetAttribute(ptr_kernel, cudaFuncAttributePreferredSharedMemoryCarveout, cudaSharedmemCarveoutMaxShared);
cudaFuncSetCacheConfig(ptr_kernel, cudaFuncCachePreferShared);
cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);

I know that cudaDeviceSetCacheConfig is used for device level partition, while the other two are used for a specific function that could override the global cache setting when being launched. But it seems that the cudaFuncSetAttribute and cudaFuncSetCacheConfig have the same functionality.

Another question is: does these APIs imply device synchronization, i.e they need to wait for all kernels done before changing the cache setting?

Robert_Crovella · May 28, 2024, 2:07pm

The cudaFuncCachePreferShared is/was applicable to devices that had a hardware resource that combined shared memory and L1 cache. Since at least Maxwell, this has not been true.

Here is my understanding of the progression:

In eg. Fermi generation, the L1 cache and the shared memory were part of the same hardware block. The texture unit was separate.
After Fermi, through Pascal, the L1 and Tex units were combined into the same hardware resource, and shared was a separate entity.
In the Volta timeframe, all 3 units were combined into a single hardware resource: L1/Tex and share.

AFAIK, the cudaFuncCachePreferShared setting was usable in the Fermi generation. I’m not sure it has any applicability any more.

The carveout setting did not exist in the fermi era, and is applicable for volta and beyond.

~~cudaFuncAttributePreferredSharedMemoryCarveout has the same applicability.~~

qwerty00 · May 29, 2024, 4:48am

Thank you. But in NVIDIA Ampere GPU Architecture Tuning Guide Section 1.4.2.3 it says that

In the NVIDIA Ampere GPU architecture, the portion of the L1 cache dedicated to shared memory (known as the carveout ) can be selected at runtime as in previous architectures such as Volta, using cudaFuncSetAttribute() with the attribute cudaFuncAttributePreferredSharedMemoryCarveout . The NVIDIA A100 GPU supports shared memory capacity of 0, 8, 16, 32, 64, 100, 132 or 164 KB per SM. GPUs with compute capability 8.6 support shared memory capacity of 0, 8, 16, 32, 64 or 100 KB per SM.

If we can allocate size for shared memory and L1, do the shared memory and L1 shared the same chip? Or there exist some other hardware mechanisms?

Robert_Crovella · May 29, 2024, 2:18pm

I’ve modified my previous response. Yes, for volta and beyond, the L1 and shared are combined. This is mentioned in the volta white paper.

Topic		Replies	Views
L1 data cache/shared memory size in Volta architecture CUDA Programming and Performance	4	1667	February 13, 2020
Actual L1 size in Volta and Turing CUDA Programming and Performance	5	1621	December 29, 2019
New cudaDeviceSetCacheConfig and cudaFuncSetCacheConfig mode CUDA Programming and Performance	2	3580	April 22, 2013
What's the take on cudaFuncSetCacheConfig() these days? CUDA Programming and Performance	1	354	August 28, 2022
GPU architecture changes : one kernel per arch ? CUDA Programming and Performance	5	4741	January 24, 2011
GeForce GT 730 l1 cache? CUDA Programming and Performance	2	1324	May 26, 2015
Share GPU/host pinned memory between host processes CUDA Programming and Performance	5	4006	March 7, 2012
Shared memory alternative CUDA Programming and Performance	7	2428	December 7, 2011
What's the difference between L1 cache and the shared memory CUDA Programming and Performance	4	13940	October 29, 2011
Selecting the 8 bytes banks of shared memory CUDA Programming and Performance	9	3271	January 11, 2021

APIs for splitting shared memory and L1 cache

Related topics