Threads per warp vs number of cores

Apologies for newbie question #1
My understanding from the CUDA programming guide is that I have
* 8 scalar (SP) cores per multiprocessor (in a Quadro FX 4600, e.g.)
* each thread is mapped to a scalar core
* scheduling and execution is performed in groups of 32 threads (one warp) at a time

Does this mean that 32 threads are scheduled at a time but, since there are only 8 scalar cores (per multiprocessor), that only 8 threads actually have instructions issued? I.e., within a block there is the potential that at each time slice 8 cores receive a new issue and that this is repeated three more times until all 32 threads in the current warp have issued?

The 32 threads are pipelined into the 8 SPs so that an entire warp can complete every 4 clock cycles. (A good trick for reducing the chance of pipeline hazards since all the threads in a warp are independent by construction.) The instruction being executed by the warp is only decoded once, and unused threads in the warp appear as bubbles in the pipeline. The warp can have disabled threads either because the kernel was launched with a block size not a multiple of 32, or if a branch instruction diverged between threads in the warp.

Excellent explanation. Thanks.