There are other factors you’re not taking into account.
Each SM has a limit on the number of threadblocks (lower than 139) that can be resident as well as a limit on the number of threads that can be resident (usually 1536 or 2048, depending on GPU). So the thread limitation would prevent more than 3-4 of your 512-thread threadblocks from being resident on an SM at any given time. New threadblocks would not become resident until previous ones had finished, and released their shared memory allocation. So given your scenario it appears that no more than 4K out of 48K of shared memory would be in use on any SM at any given time.
Furthermore, threadblocks are not issued to SMs until there are sufficient resources of all types necessary to support that threadblock. So even if your threadblocks were using, say, 32KB of shared memory, that just means that shared memory would become the limiting factor, and no more than 1-3 threadblocks would be resident on an SM, at any given time, due to shared memory resource limits.