Newbie confusion: thread, block, multiprocessor and processor

xinwu · April 12, 2011, 2:20pm

Hi, I’m a newbie.

I know one block consists of several threads, and one streaming multiprocessor (SM) consists of (usually) 8 streaming processors (SPs). But I’m confused about the relationship between thread, block and SM, SP.

I have the following concepts:

(a) one block resides in one SM, and the 8 SPs run all the threads in the block.

(b) one block resides in one SP, and this SP runs all the threads.

I don’t know which is correct.

Moreover, the figure 1-4 in “NVIDIA CUDA C Programming Guide Version 3.2” (sorry, I don’t know how to insert the figure in the post.) illustrates that several blocks can be processed by several cores simultaneously. What’s the meaning of “core” and “block”?

Thank you in advance!

seibert · April 12, 2011, 5:29pm

A streaming multiprocessor executes blocks. The scheduler sends blocks to available SMs for processing. Once a block starts on a SM, it must run to completion on that SM. Blocks cannot be suspended and cannot be migrated to other SMs. If resource usage (like shared memory and registers) allows, multiple blocks can be sent to the same SM for execution.

Inside a SM, there is another scheduler that issues warp-level instructions. Each block is composed of 1 or more warps. When a warp is available for execution (i.e., not waiting for memory reads or synchronization barrier), the scheduler can issue the next instruction for that warp to some number of streaming processors (SPs, or now NVIDIA is calling them “CUDA cores”). On compute capability 1.x, the warp was processed by all 8 SPs. On compute capability 2.0, the scheduler issues the next instruction for two different warps every clock, and each warp is processed by 16 SPs (32 total on the SM). Compute capability 2.1 does the same as 2.0, but can issue one additional instruction from a warp to another 16 SPs (48 total on the SM).

The number of SPs in a SM determines the maximum possible instruction throughput for the SM. However, there is not a one-to-one mapping between a thread and a SP. SPs are just computation engines that process whatever instructions are pushed into their pipelines, and those instructions will in general come from many different threads. If you have more SPs, you can process more thread instructions at the same time, but all of those instructions will come from blocks that have been assigned to the parent SM.

xinwu · April 13, 2011, 5:28am

Thank you so much!

Topic		Replies	Views
Difference between cuda core & streaming multiprocessor CUDA Programming and Performance	1	64291	February 13, 2010
SP and Warp CUDA Programming and Performance	3	3415	May 2, 2010
Doubt Streaming Multiprocessor CUDA Programming and Performance	0	3442	June 19, 2009
How to understand "active thread block"? CUDA Programming and Performance	4	536	August 4, 2023
How they work betweem SM and block SM, SP, Block, Thread and so on. CUDA Programming and Performance	1	4321	January 8, 2008
How many threads can reside in a CUDA core at the same time? CUDA Programming and Performance	2	879	January 18, 2019
thread, warp, block, grid, device CUDA Programming and Performance	3	6434	November 25, 2016
questions about sp and sm CUDA Programming and Performance	5	4017	June 19, 2019
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2171	March 19, 2011
CUDA hardware level: Streaming Multiprocessor CUDA Programming and Performance	1	2636	April 27, 2015

Newbie confusion: thread, block, multiprocessor and processor

Related topics