Execution configuration/shared memory

Forgive me if this is a stupid question, but according to the CUDA Programming Guide (http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf), the form of an execution configuration has the syntax “<<< Dg, Db, Ns, S >>>”. The document says:

“Ns is of type size_t and specifies the number of bytes in shared memory that
is dynamically allocated PER BLOCK for this call in addition to the statically
allocated memory; this dynamically allocated memory is used by any of the
variables declared as an external array as mentioned in Section B.2.3; Ns is an
optional argument which defaults to 0;”

I uppercase the “PER BLOCK” description because I don’t understand what it means. Does it mean that if I call a kernel with Ns there is a total of Ns bytes allocated for all threads in a block to share–which is what I think the author is trying to say? Or, is it Ns * blockDim.x * blockDim.y bytes for all threads to share–which is what “per anything” would normally mean, and does not make any sense?

That is what it means.