Hi, community member,
when i tried to print my device max shared memory available, I find tow property in CUDA, that’s deviceProp.sharedMemPerBlock and deviceProp.sharedMemPerMultiprocessor.
In A800 GPU, the deviceProp.sharedMemPerBlock is 49152 bytes and the deviceProp.sharedMemPerMultiprocessor is 167936 bytes.
I know that a SM can run multiple blocks, my question is even a kernel have one block only, max shared memory available in this block is 49152 bytes?
For example, In A800 I have a kernel function where the number of blocks is equal to the number of sm, that is, 108.
In this case each SM run a block, although the maximum shared memory available of each sm is 167936 bytes, the maximum shared memory available of each block is 49152 bytes actually.
Is that true? Or CUDA will do some optimization when each SM runs a block, so that the maximum shared memory available of each block can be greater than 49152 bytes?