L1 data cache/shared memory size in Volta architecture

I understand that the L1 data cache and shared memory are combined for cards of the Volta architecture. My question is, suppose I don’t configure the shared memory size explicitly and
I don’t use shared memory, will the L1 cache occupy all of the (128kB/SM) combined memory?
If not, should I configure the shared memory size manually to allow for a larger L1 cache?

Update: I found this quote in the “Inside Volta” blog post:
“The combined capacity is 128 KB/SM, more than 7 times larger than the GP100 data cache, and all of it is usable as a cache by programs that do not use shared memory.”
Still, I am wondering if I have to make it explicit that I am not using shared memory.

In Volta the L1 cache, texture cache, and shared memory are backed by a combined 128 KB data cache. As in previous architectures, such as Kepler, the portion of the cache dedicated to shared memory (known as the carveout) can be selected at runtime using cudaFuncSetAttribute() with the attribute cudaFuncAttributePreferredSharedMemoryCarveout. Volta supports shared memory capacities of 0, 8, 16, 32, 64, or 96 KB per SM.

You need to explicitly set shared memory capacity to 0.

Best developers-advice search experience for me ever. I was literally just wondering the same thing myself. It was asked 7 hours ago, answered within the last hour. Thanks to both of you.

Follow on question.

Is it possible to verify/query the partitioning of the combined cache by the L1/texture caches and shared-memory at runtime?

You can on a per kernel basis. Note that if you don’t set preferredShmemCarveout, it will return -1.

#include <cstdio>

template<typename T>
__global__ void addKernel(T *c, const T *a, const T *b)
    __shared__ float s_data[1024];
    int i = threadIdx.x;
    c[i] = a[i] + b[i];

    s_data[i] = c[i];

    c[i] = s_data[i];

int main()
    cudaFuncAttributes attrib;

    int bytes = 1<<8;
    int carveout = 50;
    cudaFuncSetAttribute(addKernel<float>, cudaFuncAttributeMaxDynamicSharedMemorySize, bytes);
    cudaFuncSetAttribute(addKernel<float>, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);

    cudaFuncGetAttributes(&attrib, addKernel<float>);

    // https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaFuncAttributes.html#structcudaFuncAttributes_1088f144760311325523b66ee90c83c51
    printf("binaryVersion: %d\n", attrib.binaryVersion);
    printf("cacheModeCA: %d\n", attrib.cacheModeCA);
    printf("constSizeBytes: %lu\n", attrib.constSizeBytes);
    printf("localSizeBytes: %lu\n", attrib.localSizeBytes);
    printf("maxDynamicSharedSizeBytes: %d\n", attrib.maxDynamicSharedSizeBytes);
    printf("maxThreadsPerBlock: %d\n", attrib.maxThreadsPerBlock);
    printf("numRegs: %d\n", attrib.numRegs);
    printf("preferredShmemCarveout: %d\n", attrib.preferredShmemCarveout);
    printf("ptxVersion: %d\n", attrib.ptxVersion);
    printf("sharedSizeBytes: %lu\n", attrib.sharedSizeBytes);


binaryVersion: 75
cacheModeCA: 0
constSizeBytes: 0
localSizeBytes: 0
maxDynamicSharedSizeBytes: 256
maxThreadsPerBlock: 1024
numRegs: 12
preferredShmemCarveout: 50
ptxVersion: 75
sharedSizeBytes: 4096

Thanks for the quick and detailed replies!
I checked the sharedSizeBytes and could confirm that it is set to zero for my kernel (when no shared memory is used), with and without setting the carveout ratio explicitly.
Thanks again!