How to allocate more than 48KB shared memory on A100?

Hi all,
I know the static allocation size is limited to 48KB in one block, but A100 has 164KB shared memory on one SM. I tried to use dynamic allocation to allocate more than 48KB shared memory. The compilation is fine, but it throw an error CUDA error: invalid argument when I allocate more than 48KB using dynamic shared memory allocation. What is the proper way to do this allocation?
Thank you

Related CUDA code looks like this:

#include<cstdio>
__global__ void sharedMemTest()
{
    __shared__ int _ss[1024];
    extern __shared__ int _s[];
    if (threadIdx.x==0)
        printf("blockIdx.x is %d s is at %x, ss is at %x\n", blockIdx.x, _s+10, _ss);
}

int main()
{
    dim3 block(32);
    dim3 grid(32);
    sharedMemTest << <grid, block, 44*1024+1>> >();
    
    cudaError_t error = cudaGetLastError();
    printf("CUDA error: %s\n", cudaGetErrorString(error));
    cudaDeviceSynchronize();
}

Problem solved. Thank you very much