Suppose I intend to run the total of 1,000,000 threads. I have 1D blocks, blockDim = 512. This will yield gridDim = 1954.
Now, each block allocates 1K of shared memory. I have 14 SMs. Since the limit for my GPU is 48K per SM, with this many blocks, it is conceivable that 139 may be allocated per SM, meaning that I may end up violating the max shared memory limitation.
Is this conclusion correct? Or are there other factors I am not taking into account?
There are other factors you’re not taking into account.
Each SM has a limit on the number of threadblocks (lower than 139) that can be resident as well as a limit on the number of threads that can be resident (usually 1536 or 2048, depending on GPU). So the thread limitation would prevent more than 3-4 of your 512-thread threadblocks from being resident on an SM at any given time. New threadblocks would not become resident until previous ones had finished, and released their shared memory allocation. So given your scenario it appears that no more than 4K out of 48K of shared memory would be in use on any SM at any given time.
Furthermore, threadblocks are not issued to SMs until there are sufficient resources of all types necessary to support that threadblock. So even if your threadblocks were using, say, 32KB of shared memory, that just means that shared memory would become the limiting factor, and no more than 1-3 threadblocks would be resident on an SM, at any given time, due to shared memory resource limits.
Ok. This clarifies it. Thank you very much.