The default shared memory size per block is 48K which you specify as the dynamic shared memory size in the kernel launch configuration. But when adding the 256 Bytes per block with the additional static shared memory array, you end up going over the max and thus get the launch error.
At a time, only one kernel can be launched on each SM. Since I allocated a memory of 48KB shared memory for each SM, I thought that (32*8/1024=)0.25KB for each kernel would be enough to be stored in the spared shared memory.
You do not need to set the dynamic shared memory size if your kernel uses static fixed size shared memory arrays. You only need dynamic shared memory when using dynamically sized automatic arrays.
The total shared memory usage is the combined total of both the dynamic shared memory (as set at runtime via the CUDA launch configuration) and the static shared memory (set at compile time via fixed sized shared arrays).
What I understood is that each SM has 64KB that can be shared between shared memory and L1 cache.
The default shared memory size is 48KB. To get to 64KB, you need to call cudaFuncSetAttribute to increase the max amount of dynamic shared memory, but the exact max will vary depending of the device so may not give you 64KB on all devices.
Of course your original example doesn’t actually use dynamic shared memory, so this extra memory goes to waste. Again, if you want to use dynamic shared memory, the shared array in the kernel needs to be an automatic or declared as assumed-size.
I have 16 SM’s and 48KB shared memory for each. Assuming the SM’s run the same number of Kernels at the end, (5600/16=)350; is it going to try to allocate (350*0.25KB=)87.5KB of shared memory instead of writing over the 0.25KB when I launched it statically?
Each SM has it’s own shared memory which is partitioned amongst the blocks actively running concurrently on the SM. This memory would be re-partitioned and reused for subsequent blocks scheduled after the previous blocks are complete.
Keep in mind that an SM can run up to 2048 concurrent threads or a max of 32 blocks (at 64 threads per block). But if you have one block that uses all the available shared memory, only that single block can be running on the SM and thus limit the occupancy. Given your block has 256 threads, you’re getting at best 12.5% theoretical occupancy since only 1/8th of the total number of concurrent threads can actually run.
Not counting register usage which also has impact, to get to 100% occupancy, you can have a max of 6KB of shared memory per block with a block size of 256, or 24KB with a block size of 1024.
Note that the CUDA occupancy calculator is now part Nisight-Compute and might help in understanding how shared memory and register use impact the theoretical occupancy.