Shared memory size per Thread Block

Document explains that cc 7.0 can support 96KB shared memory per thread block.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications

But my simple test code could not invoke a kernel with 49KB shared memory. (It’s OK up to 48KB)
Test was done on Titan V and the output of cuobjdump looks good.(arch = sm_70)
What’s the problem?

#include <stdio.h>

__global__ void
test()
{
}

int
main(void)
{
        dim3 dimBlock(1,1);
        dim3 dimGrid(1,1);
        cudaError       err;

        test<<<dimGrid, dimBlock, 49 * 1024>>>();

        err = cudaGetLastError();
        if (err != cudaSuccess) {
                printf("error: %s\n", cudaGetErrorString(err));
        }
}

Following is cuobjdump output for my test executable.

Fatbin ptx code:
================
arch = sm_70
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

Please read the documentation:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x

Compute capability 7.x devices allow a single thread block to address the full capacity of shared memory: 96 KB on Volta, 64 KB on Turing. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) and require an explicit opt-in using cudaFuncSetAttribute() as follows

// Host code
int maxbytes = 98304; // 96 KB
cudaFuncSetAttribute(test, cudaFuncAttributeMaxDynamicSharedMemorySize, maxbytes);
4 Likes

Great thanks, crovella.

I’ve checked that over 48KB shared memory is assigned to a TB after cudaFuncSetAttribute() has been applied as you pointed.