CUDA GPU SMs can have more than 1 block resident. The maximum number of threads per SM is a limit across all resident threadblocks. This number being 2048 means, for example, that 2 threadblocks of 1024 threads each could possibly be resident.
The CUDA threadblock limit of 1024 threads is pretty much a standard part of CUDA for at least the last 10 years. Compute capability 2.0 devices until current devices all have this same limit. I wouldn’t be able to offer any further explanation than that.
(cc1.x devices had a limit of 512 threads per block.)
It could be related to block-wide synchronization for certain instructions (needing more physical data lines, shadow registers, bits, …). It could also be an artificial limitation to keep compatibility of kernels to other architectures.
The maximum number of threads per block is an architectural decision based upon area cost and compatibility. Increasing the dimensions increase wires and increase area cost. This is not something that software developers generally consider but is critical in efficient hardware design. Increasing the maximum number of threads per block on some architectures (e.g. 100 class) would mean either lower end parts also need to increase the value or there will be compatibility issues. The graphics/mobile focused GPUs tend to have less threads per SM to reduce cost and still provide optimal resources for typical use case. One of the benefits of CUDA is the ability to write a compatible program that can scale from 1-2 SM GPUs all the way to 100+ SM GPUs.