Streaming multriprocessors and processing blocks

oglr · January 7, 2024, 9:23am

Hello. I am reading Hwu, Kirk and Hajj’s “Programming Massively Parallel Processors: A Hands-on approach”, 4th edition. I am confused about the organization of streaming multiprocessors into processing blocks as described in the chapter on compute architecture and scheduling. The authors give the example of the Ampere A100 SM which has 4 processing blocks with 16 cores each. The authors state that threads in the same warp are assigned to the same processing block which fetches instructions and executes them for all threads in the warp at the same time. I was under the assumption that one core could execute only a single thread at any one time. But with 32 threads in a warp, how can the 16 cores in an Ampere A100 SM processing block execute 32 threads at the same time? If anyone could help clarify this, I would be grateful.

rs277 · January 7, 2024, 7:51pm

The warp is split across 2 cycles, 16 threads at a time. The “4 processing blocks with 16 cores each”, is referred to as an SMSP - SM Sub Partition. Although answering a question about instruction latency, Greg’s answer here may clarify things. His " EXAMPLE 1 : 1 Warp per SM Sub-partition shows the ALU active for two consecutive cycles processing all 32 threads.

oglr · January 8, 2024, 8:30am

Thanks a lot! I get the concept now having read Greg’s answer, and thanks for making me aware of the SM sub-partition term.

Topic		Replies	Views
Newbie confusion: thread, block, multiprocessor and processor CUDA Programming and Performance	2	1484	April 13, 2011
How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated? CUDA Programming and Performance cuda , board-design	4	1379	September 28, 2023
What Is The Relation Between Warp And SM Processing Block? CUDA Programming and Performance	1	2024	May 25, 2018
Warp Size Question CUDA Programming and Performance	21	14448	June 18, 2010
Understanding warp scheduling on a Streaming multiprocessor CUDA Programming and Performance cuda	3	237	February 4, 2026
question about warp, block and threads CUDA Programming and Performance	4	2093	February 3, 2009
No.of threads per scalar processor CUDA Programming and Performance	6	6631	July 10, 2009
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2698	November 1, 2012
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	20353	July 5, 2011
A question about the correspondence between warp and core CUDA Programming and Performance	17	8160	February 1, 2019

Streaming multriprocessors and processing blocks

Related topics