I studied matrix-matrix multiplication example, and know how shared memory is explicitly stated in kernel function in tiled implementation for matrix-matrix multiplication .
Now I have some confusion of the shared memory usage .
If I did not explicitly state any shared memory usage in kernel function, somehow the .cubin file shows that still a small amount of shared memory is used for each block. why? and what 's that shared memory is used for?
In CUDA programing guide, it’s said that in <<<Dg, Db, Ns>>>, Ns is the size of dynamically allocated shared memory. So anybody could give an example of in which case we need to dynamically allocate shared memory? and what’s the difference of dynamic/static? Also, its said that such dynamic memory is used by any variables declared as an external array… I got confused of why ‘external array’, aren’t those variables in shared memory supposed only accessible to all threads within a block?
Sorry if I make the questions confusing … any ideas of understanding this is appreciated. Thanks:)