Warp Size Question

At this point i am utterly and completely confused. Now I know a lot of places in CUDA documentation say that we can have many blocks per SM. Now how can we have multiple “active” blocks per multiprocessor? I thought that 1 SM executes 1 block in groups of 32 threads at a time, once it is done, it moves on to another block.
Also, I am still confused about the 32 threads (warp size) at a time. With 8 cores, each core can execute only 1 thread at a given time right? Or is it 1 instruction from each thread at a given time? Which means there is some kind of scheduling going on even at the processor core level?
It will be nice if some one can walk through the whole process of block allocation to SM and then down to mapping threads with cores and what exactly gets executed “at a time” and “at a given (instant of) time” (maybe with an example).

Simple - each SM has dedicated hardware for managing the state of and dispatching instructions from a number of warps simultaneously (how many depends on the hardware generation). A block must be tied to a given SP, but the warps on a SP can come from different blocks (ie you can have more than 1 block per SP). The grid level scheduler assigns as many blocks as there are available resources to a given SP, and it runs those blocks until they are all finished, then it sends more. The SP level hardware selects, schedules and runs warps which have instructions and data available, those which don’t get suspended until they have instructions and data available.

No. It is only by having a lot of “active” warps available that all the latency in the architecture can be hidden.

One instruction from 1 thread per clock cycle - so each warp requires a minimum of 4 clock cycles to retire an instruction from a warp of 32 threads on a compute 1.0/1.1/1.2/1.3 device. If it makes it easy just think of it that each core runs each instruction 4 times for 4 different threads.