A question of using shared memory

I studied matrix-matrix multiplication example, and know how shared memory is explicitly stated in kernel function in tiled implementation for matrix-matrix multiplication .
Now I have some confusion of the shared memory usage .

  1. If I did not explicitly state any shared memory usage in kernel function, somehow the .cubin file shows that still a small amount of shared memory is used for each block. why? and what 's that shared memory is used for?

  2. In CUDA programing guide, it’s said that in <<<Dg, Db, Ns>>>, Ns is the size of dynamically allocated shared memory. So anybody could give an example of in which case we need to dynamically allocate shared memory? and what’s the difference of dynamic/static? Also, its said that such dynamic memory is used by any variables declared as an external array… I got confused of why ‘external array’, aren’t those variables in shared memory supposed only accessible to all threads within a block?

Sorry if I make the questions confusing … any ideas of understanding this is appreciated. Thanks:)

  1. blockDim, gridDim and kernel arguments are passed into shared memory.

  2. Say you need 1 float of shared memory for each thread in the block, but you call your kernel with different block sizes. Then you need to use Ns to specify the amount of dynamic shared memory to allocate for each kernel run. The only difference between the dynamic and static is that whether the amount of shared memory allocated per block is determined by the compiler or the caller of the kernel.

See the code examples for the external array bit. You declare a dynamic shared memory array like this: “extern shared float;” It seems an odd syntax, but it does make sense as the shared array is technically defined external to the compilation unit.

Thanks a lot! It helps clarifying my questions.

I’m not sure which example you are talking about, I cannot find ‘external array bit’ in cuda code samples. Could you kindly give the exact name or link? Thanks!

I was referring to the CUDA programming guide, section 4.2.2.3 where it exactly has: “extern shared float shared;”

You can also probably find some examples of this in the SDK samples, though I’m not sure which ones might use dynamic shared memory. You can always grep the SDK directory for it “extern shared”.

Thanks for the reply…

I read from programing guide, section 4.2.2.2, that for variables declared in shared memory as an external array “extern shared float shared”, the size of the array is determined at launch time.

Here I’m confused of " determined at launch time", does that mean shared memory size per block is unknown before the program starts running? Then how compiler decides number of parallel blocks and whether they can fit into the stream multiprocessor (like what occupancy calculator does)?

Thanks.

-Y

You answered this question for yourself in your original post

Hence the compiler does not determine the amount of extern shared memory, your program does in software and passes that to the driver when launching the kernel.