Newbie confusion: thread, block, multiprocessor and processor

Hi, I’m a newbie.

I know one block consists of several threads, and one streaming multiprocessor (SM) consists of (usually) 8 streaming processors (SPs). But I’m confused about the relationship between thread, block and SM, SP.

I have the following concepts:

(a) one block resides in one SM, and the 8 SPs run all the threads in the block.

(b) one block resides in one SP, and this SP runs all the threads.

I don’t know which is correct.

Moreover, the figure 1-4 in “NVIDIA CUDA C Programming Guide Version 3.2” (sorry, I don’t know how to insert the figure in the post.) illustrates that several blocks can be processed by several cores simultaneously. What’s the meaning of “core” and “block”?

Thank you in advance!

A streaming multiprocessor executes blocks. The scheduler sends blocks to available SMs for processing. Once a block starts on a SM, it must run to completion on that SM. Blocks cannot be suspended and cannot be migrated to other SMs. If resource usage (like shared memory and registers) allows, multiple blocks can be sent to the same SM for execution.

Inside a SM, there is another scheduler that issues warp-level instructions. Each block is composed of 1 or more warps. When a warp is available for execution (i.e., not waiting for memory reads or synchronization barrier), the scheduler can issue the next instruction for that warp to some number of streaming processors (SPs, or now NVIDIA is calling them “CUDA cores”). On compute capability 1.x, the warp was processed by all 8 SPs. On compute capability 2.0, the scheduler issues the next instruction for two different warps every clock, and each warp is processed by 16 SPs (32 total on the SM). Compute capability 2.1 does the same as 2.0, but can issue one additional instruction from a warp to another 16 SPs (48 total on the SM).

The number of SPs in a SM determines the maximum possible instruction throughput for the SM. However, there is not a one-to-one mapping between a thread and a SP. SPs are just computation engines that process whatever instructions are pushed into their pipelines, and those instructions will in general come from many different threads. If you have more SPs, you can process more thread instructions at the same time, but all of those instructions will come from blocks that have been assigned to the parent SM.

1 Like

Thank you so much!