Turing - accessing 64KB shared mem from PyCuda (driver api)

With Cuda 10.1 + Pycuda and a 2060 Super, I’m trying to force access to the additional available shared memory i.e overcome 48 KB per threadblock limitation, and obtain access to 64KB shared memory.

I understand that this can be done using the Runtime API.

But so far (in my relative inexperience) have been able to deduce how it be done:
A) With the Driver API
B) Using PyCuda (which uses the Driver API)

To try to be more clear, I believe that I am dynamically allocating the shared memory:

This snippet of kernel code (featuring 48KB shared memory) compiles fine:

__shared__ uint smem[12288];

…but this does not:

__shared__ uint smem[12289];

(this being 4 Bytes beyond 48KB)

That is not a dynamic memory allocation for shared memory.

A dynamic allocation looks like this:

extern __shared__ uint smem[];

and it requires a kernel execution configuration argument to specify how much shared memory (in bytes) should be provided.

You’ll want to set a device attribute on Volta and Turing to provide more than 48KB of shared:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x

That should be all that is needed.