How to understand "active thread block"?

Shaquille · August 3, 2023, 5:02pm

Hello, NV’s experts
I cannot understand this term: “Active Thread Blocks per Multiprocessor”, I found it in CUDA_Occupancy_Calculator.xls, as following:
1691081061(1)
I don’t know how to understand it
I think there is only one active thread block at a same time in one SM.
at least, I don’t think SM can issue different warps from different thread block at a same time, is it right?
how to understand “active”?
or say, they can be queued into warp slots, and only one block’s warps can be issued by warp scheduler? I’m not sure.

Robert_Crovella · August 3, 2023, 5:28pm

Not correct.

Let’s do a quick thought experiment. We know that CUDA threadblocks are limited to 1024 threads. How then could we ever achieve 1536 “active threads per multiprocessor” if only 1024 threads can be deposited at a time?

CUDA GPU SMs can have multiple threadblocks resident or “active”. The warps of every “active” threadblock are distributed amongst the SMSPs (SM Sub-Partitions) each of which has a warp scheduler, and the warp scheduler can choose from any available warps, from any of the “active” or resident threadblocks, to schedule instructions on the SMSP.

Also not correct, as already discussed. Warps (that are not stalled for some reason) from any active threadblocks are available for the warp schedulers to issue.

Each SM design has a hardware limit as to the maximum number of threadblocks that can be resident/deposited/“active”, this limit is specified in table 15 of the programming guide, specifically “Maximum number of resident blocks per SM”

striker159 · August 3, 2023, 5:38pm

The programming guide also has a section about the hardware side of CUDA, which may be worth a read.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation

When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution.

The execution context (program counters, registers, and so on) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.

Shaquille · August 4, 2023, 1:19am

thank you

Shaquille · August 4, 2023, 1:19am

thanks for your clear description

Topic		Replies	Views
Question about threads per block and warps per SM CUDA Programming and Performance	13	16283	October 6, 2022
Quick block question! CUDA Programming and Performance	8	5357	September 1, 2009
thread, warp, block, grid, device CUDA Programming and Performance	3	6500	November 25, 2016
Newbie confusion: thread, block, multiprocessor and processor CUDA Programming and Performance	2	1102	April 13, 2011
CUDA hardware level: Streaming Multiprocessor CUDA Programming and Performance	1	2641	April 27, 2015
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2497	July 4, 2019
question about warp, block and threads CUDA Programming and Performance	4	2002	February 3, 2009
help me understand cuda CUDA Programming and Performance	4	6882	February 10, 2010
Scheduling Thread Blocks CUDA Programming and Performance	5	1203	July 29, 2021
threads per block / multi processor, contradiction ? CUDA Programming and Performance	5	1656	January 23, 2009

How to understand "active thread block"?

Related topics