Streaming multriprocessors and processing blocks

Hello. I am reading Hwu, Kirk and Hajj’s “Programming Massively Parallel Processors: A Hands-on approach”, 4th edition. I am confused about the organization of streaming multiprocessors into processing blocks as described in the chapter on compute architecture and scheduling. The authors give the example of the Ampere A100 SM which has 4 processing blocks with 16 cores each. The authors state that threads in the same warp are assigned to the same processing block which fetches instructions and executes them for all threads in the warp at the same time. I was under the assumption that one core could execute only a single thread at any one time. But with 32 threads in a warp, how can the 16 cores in an Ampere A100 SM processing block execute 32 threads at the same time? If anyone could help clarify this, I would be grateful.

The warp is split across 2 cycles, 16 threads at a time. The “4 processing blocks with 16 cores each”, is referred to as an SMSP - SM Sub Partition. Although answering a question about instruction latency, Greg’s answer here may clarify things. His " EXAMPLE 1 : 1 Warp per SM Sub-partition shows the ALU active for two consecutive cycles processing all 32 threads.

Thanks a lot! I get the concept now having read Greg’s answer, and thanks for making me aware of the SM sub-partition term.