A streaming multiprocessor executes blocks. The scheduler sends blocks to available SMs for processing. Once a block starts on a SM, it must run to completion on that SM. Blocks cannot be suspended and cannot be migrated to other SMs. If resource usage (like shared memory and registers) allows, multiple blocks can be sent to the same SM for execution.
Inside a SM, there is another scheduler that issues warp-level instructions. Each block is composed of 1 or more warps. When a warp is available for execution (i.e., not waiting for memory reads or synchronization barrier), the scheduler can issue the next instruction for that warp to some number of streaming processors (SPs, or now NVIDIA is calling them “CUDA cores”). On compute capability 1.x, the warp was processed by all 8 SPs. On compute capability 2.0, the scheduler issues the next instruction for two different warps every clock, and each warp is processed by 16 SPs (32 total on the SM). Compute capability 2.1 does the same as 2.0, but can issue one additional instruction from a warp to another 16 SPs (48 total on the SM).
The number of SPs in a SM determines the maximum possible instruction throughput for the SM. However, there is not a one-to-one mapping between a thread and a SP. SPs are just computation engines that process whatever instructions are pushed into their pipelines, and those instructions will in general come from many different threads. If you have more SPs, you can process more thread instructions at the same time, but all of those instructions will come from blocks that have been assigned to the parent SM.