Streaming multriprocessors and processing blocks

The warp is split across 2 cycles, 16 threads at a time. The “4 processing blocks with 16 cores each”, is referred to as an SMSP - SM Sub Partition. Although answering a question about instruction latency, Greg’s answer here may clarify things. His " EXAMPLE 1 : 1 Warp per SM Sub-partition shows the ALU active for two consecutive cycles processing all 32 threads.

