APIs for splitting shared memory and L1 cache

Hi, I recently noticed that there are 3 APIs for the partition of shared memory and L1:

__global__ void ptr_kernel() {
    //...
}

cudaFuncSetAttribute(ptr_kernel, cudaFuncAttributePreferredSharedMemoryCarveout, cudaSharedmemCarveoutMaxShared);
cudaFuncSetCacheConfig(ptr_kernel, cudaFuncCachePreferShared);
cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);

I know that cudaDeviceSetCacheConfig is used for device level partition, while the other two are used for a specific function that could override the global cache setting when being launched. But it seems that the cudaFuncSetAttribute and cudaFuncSetCacheConfig have the same functionality.

Another question is: does these APIs imply device synchronization, i.e they need to wait for all kernels done before changing the cache setting?

The cudaFuncCachePreferShared is/was applicable to devices that had a hardware resource that combined shared memory and L1 cache. Since at least Maxwell, this has not been true.

Here is my understanding of the progression:

  • In eg. Fermi generation, the L1 cache and the shared memory were part of the same hardware block. The texture unit was separate.
  • After Fermi, through Pascal, the L1 and Tex units were combined into the same hardware resource, and shared was a separate entity.
  • In the Volta timeframe, all 3 units were combined into a single hardware resource: L1/Tex and share.

AFAIK, the cudaFuncCachePreferShared setting was usable in the Fermi generation. I’m not sure it has any applicability any more.

The carveout setting did not exist in the fermi era, and is applicable for volta and beyond.

cudaFuncAttributePreferredSharedMemoryCarveout has the same applicability.

1 Like

Thank you. But in NVIDIA Ampere GPU Architecture Tuning Guide Section 1.4.2.3 it says that

In the NVIDIA Ampere GPU architecture, the portion of the L1 cache dedicated to shared memory (known as the carveout ) can be selected at runtime as in previous architectures such as Volta, using cudaFuncSetAttribute() with the attribute cudaFuncAttributePreferredSharedMemoryCarveout . The NVIDIA A100 GPU supports shared memory capacity of 0, 8, 16, 32, 64, 100, 132 or 164 KB per SM. GPUs with compute capability 8.6 support shared memory capacity of 0, 8, 16, 32, 64 or 100 KB per SM.

If we can allocate size for shared memory and L1, do the shared memory and L1 shared the same chip? Or there exist some other hardware mechanisms?

I’ve modified my previous response. Yes, for volta and beyond, the L1 and shared are combined. This is mentioned in the volta white paper.

1 Like