Actual L1 size in Volta and Turing

I want to use the L1 size of a TitanV (volta) device in my program and I am confused with the actual size. According to [1] :

In Volta the L1 cache, texture cache, and shared memory are backed by a combined 128 KB data cache.

Volta increases the maximum capacity of the L1 cache to 128 KB

So should I use the total 128KB size or exclude texture and shared capacities? Shared size is also variable.
On the other hand, in TechPowerup [2], I see that L1 size is 96KB.

The same question exists for 2080Ti (Turing). According to [3]

The total size of the unified L1 / Shared Memory cache in Turing is 96 KB.

Turing supports two carveout configurations, either with 64 KB of shared memory and 32 KB of L1, or with 32 KB of shared memory and 64 KB of L1.

But TechPowerup says it is 64KB [4].

What are the actual sizes?

[1] https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#l1-cache
[2] https://www.techpowerup.com/gpu-specs/titan-v.c3051
[3] https://docs.nvidia.com/cuda/turing-tuning-guide/index.html#l1-cache
[4] https://www.techpowerup.com/gpu-specs/geforce-rtx-2080-ti.c3305

One of the best resources for information is the CUDA Programming guide.

Check out https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#l1-cache and https://docs.nvidia.com/cuda/turing-tuning-guide/index.html#l1-cache.

According to the documentation, a single block can use up to 96KB, on Volta, and supports shared memory capacities of 0, 8, 16, 32, 64, or 96 KB per SM.

On Turing, a single block can use up to 64KB, and supports two carveout configurations, either with 64 KB of shared memory and 32 KB of L1, or with 32 KB of shared memory and 64 KB of L1.

In short, both architecture allows the programmer to change the amount at runtime. If a kernel isn’t using any shared memory, it might behove the programmer to allot more space to L1 to minimize cache misses.

How to set shared memory amount?
the portion of the cache dedicated to shared memory (known as the carveout) can be selected at runtime using cudaFuncSetAttribute() with the attribute cudaFuncAttributePreferredSharedMemoryCarveout

Check out https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

Remember that shared memory is a limited resource and if a single block uses all available shared memory your kernel might run sub-optimal.

You can use Nsight Compute to profile performance.

Thanks. Actually I want to do some cache capacity analyses. Still my questions exist…

According to the documentation, a single block can use up to 96KB, on Volta, and supports shared memory capacities of 0, 8, 16, 32, 64, or 96 KB per SM.

You mean that I can use 96KB+0KB (shared) or 96KB+96KB (shared)? So what is that 128KB mentioned in doc?
If I set the l1 size in my program to 96KB, am I missing some capacity of the cache?

On Volta, the combined capacity of L1 cache and shared memory is 128KB. So if you configure shared memory to its max capacity of 96KB, there will be 32KB leftover to be used by L1.

The other extreme would be to configure shared memory to utilize 0KB. Therefore, L1 would have all 128KB available.

Refer to section Enhanced L1 Data Cache and Shared Memory in the Volta architecture whitepaper https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

If I understand correctly from [1] (slide 74), I have to use

cudaFuncSetAttribute(MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);

The second argument can be either cudaFuncAttributeMaxDynamicSharedMemorySize or cudaFuncAttributePreferredSharedMemoryCarveout or cudaFuncAttributeMax according to [2] which correponds to L1(32KB)/SHARED(96KB) and L1(64KB)/SHARED(64KB) and L1(128KB)/SHARED(0), respectively.

I don’t know about the third argument. Is that a percent (0-100) or (0-1)? Then these calls should be valid.

Smem/ L1 splits
96KB / 32KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 100);

64KB / 64KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, 80);

32KB / 96KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, 60);

16KB / 112KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, 40);

8KB / 120KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, 20);

0KB /128 KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMax, 0);

Am I right?

[1] http://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf
[2] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html