Hello, NV’s experts
I cannot understand this term: “Active Thread Blocks per Multiprocessor”, I found it in CUDA_Occupancy_Calculator.xls, as following:
I don’t know how to understand it
I think there is only one active thread block at a same time in one SM.
at least, I don’t think SM can issue different warps from different thread block at a same time, is it right?
how to understand “active”?
or say, they can be queued into warp slots, and only one block’s warps can be issued by warp scheduler? I’m not sure.
Not correct.
Let’s do a quick thought experiment. We know that CUDA threadblocks are limited to 1024 threads. How then could we ever achieve 1536 “active threads per multiprocessor” if only 1024 threads can be deposited at a time?
CUDA GPU SMs can have multiple threadblocks resident or “active”. The warps of every “active” threadblock are distributed amongst the SMSPs (SM Sub-Partitions) each of which has a warp scheduler, and the warp scheduler can choose from any available warps, from any of the “active” or resident threadblocks, to schedule instructions on the SMSP.
Also not correct, as already discussed. Warps (that are not stalled for some reason) from any active threadblocks are available for the warp schedulers to issue.
Each SM design has a hardware limit as to the maximum number of threadblocks that can be resident/deposited/“active”, this limit is specified in table 15 of the programming guide, specifically “Maximum number of resident blocks per SM”
The programming guide also has a section about the hardware side of CUDA, which may be worth a read.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation
When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution.
The execution context (program counters, registers, and so on) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.
thank you
thanks for your clear description