Shared memory size per Thread Block

Document explains that cc 7.0 can support 96KB shared memory per thread block.

But my simple test code could not invoke a kernel with 49KB shared memory. (It’s OK up to 48KB)
Test was done on Titan V and the output of cuobjdump looks good.(arch = sm_70)
What’s the problem?

#include <stdio.h>

__global__ void

        dim3 dimBlock(1,1);
        dim3 dimGrid(1,1);
        cudaError       err;

        test<<<dimGrid, dimBlock, 49 * 1024>>>();

        err = cudaGetLastError();
        if (err != cudaSuccess) {
                printf("error: %s\n", cudaGetErrorString(err));

Following is cuobjdump output for my test executable.

Fatbin ptx code:
arch = sm_70
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit

Please read the documentation:

Compute capability 7.x devices allow a single thread block to address the full capacity of shared memory: 96 KB on Volta, 64 KB on Turing. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) and require an explicit opt-in using cudaFuncSetAttribute() as follows

// Host code
int maxbytes = 98304; // 96 KB
cudaFuncSetAttribute(test, cudaFuncAttributeMaxDynamicSharedMemorySize, maxbytes);

Great thanks, crovella.

I’ve checked that over 48KB shared memory is assigned to a TB after cudaFuncSetAttribute() has been applied as you pointed.