With Cuda 10.1 + Pycuda and a 2060 Super, I’m trying to force access to the additional available shared memory i.e overcome 48 KB per threadblock limitation, and obtain access to 64KB shared memory.
I understand that this can be done using the Runtime API.
But so far (in my relative inexperience) have been able to deduce how it be done:
A) With the Driver API
B) Using PyCuda (which uses the Driver API)
To try to be more clear, I believe that I am dynamically allocating the shared memory:
This snippet of kernel code (featuring 48KB shared memory) compiles fine:
__shared__ uint smem[12288];
…but this does not:
__shared__ uint smem[12289];
(this being 4 Bytes beyond 48KB)