Physical understanding of the structure of SMs

I have trouble about clearly understanding the physical structure of GPU. I want to imagine how it is. About CPU I can understand it but for nvidia GPU no.

I ahve problem about relating hardware view to software view.

As I undersnat we can have define how many threads we want for each kernels. So can we SM without thread and move other threads to another SM? (I think this is wrong but at the same time the software behaviar is diferent).

Here might be a good place to start: Programming Guide :: CUDA Toolkit Documentation