Just to clarify - CUDA 1.x devices have 16KB (presumably KiB as in 2^10, not kB the SI unit as in 10^3) of shared memory per MP…
But it’s feasibly impossible to allocate 16KB of shared memory, because the formal parameter list will (always?) take up some of that shared memory…?
Is it possible to have formal parameters shoved into a mix of constant memory / registers? (eg: using raw pxt?) or is it simply impossible to allocate a full 16KB of shared memory for full use by the kernel?
I think there was just another thread on this recently. I believe the official word was that due the parameters and some other overhead, you aren’t able to allocate the whole 16K.
Though, I imagine that if you used PTX…you could probably read the parameters into registers in the beginning of the kernel, then overwrite the whole shared memory space with whatever you wanted…but I don’t know what that would really get you. The parameters can’t be that big anyway, and overwriting them probably wouldn’t even get you an extra byte per thread.