How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated?

e_qiao_liang · July 14, 2023, 4:26am

The Turing SM is partitioned into four processing blocks depicted as the image.
So I am confused on two questinos:

One warp is scheduled on a processing block in one SM?
If No.1 is yes, then what the hardware do when 32 integers is calculated in a warp? because there is only 16 int cores in a processing block , and the other 16 float cores keep idle? and compute 32 integers by executing the 16 int cores tow times?

截屏2023-07-14 12.14.49542×685 85.3 KB

Robert_Crovella · July 14, 2023, 1:45pm

Yes. When a threadblock is deposited on a SM by the CWD/block scheduler, the warps in that threadblock are statically assigned to SMSPs (SM sub-partitions). Each sub-partition has a single warp scheduler, so this is like saying the warps are statically assigned to each of the warp schedulers. If there is only one warp scheduler, all warps will be assigned to that. If there are two warp schedulers, about half of the warps will be assigned to one (assuming the SM is empty) and about half will be assigned to the other. If there are 4 warp schedulers in the SM, and assuming an initially empty SM, then the warps will be distributed approximately 1/4 to each warp scheduler. Certain functional unit resources in a SM are also partitioned between the SMSPs. So a SM with 64 “cuda cores” and 4 warp schedulers means that each SMSP/warp scheduler actually only has 16 “cuda cores” to use or assign instructions to.

A warp scheduler always schedules (i.e. issues) instructions warp-wide. Any time a warp scheduler needs to schedule an instruction for which there are less than 32 of the corresponding supporting functional units available, the warp scheduler will schedule that instruction over multiple clock cycles. If there are 16 units available, it will take 2 cycles. If there are 8 units available, it will take 4 cycles. If there are 4 units available, it will take 8 cycles, and if there are 2 units available (such as would be the case for a FP64 instruction) it will take 16 cycles, to schedule that instruction.

e_qiao_liang · July 17, 2023, 5:56am

Got it，thx !!!

spring_wind · September 28, 2023, 10:12am

As you said, for a warp with 32 active threads, only 16 units are available, it will take 2 cycles. Then, is it true that thread#0~thread#15 will be executed at cycle#0 and thread#16 and thread#31 will be executed at cycle#1?

Robert_Crovella · September 28, 2023, 2:10pm

Yes, something like that. I don’t know that the detailed behavior is well specified or published, but AFAIK the low level behavior is that in the first cycle, the 16 units will begin processing 16 threads, and in the next cycle the 16 units will begin processing the next 16 threads. 2 cycles. I don’t know which threads go in which cycle, or how decisions are made about that.

Topic		Replies	Views
How is a warp executed on a SM CUDA Programming and Performance hw , cuda	0	317	September 7, 2020
Streaming multriprocessors and processing blocks CUDA Programming and Performance	3	757	January 8, 2024
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	886	June 17, 2024
Thread Scheduling Concept CUDA Programming and Performance	3	3766	June 21, 2012
Warp thread Scheduling CUDA Programming and Performance	7	2261	June 28, 2010
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12254	February 12, 2013
What Is The Relation Between Warp And SM Processing Block? CUDA Programming and Performance	1	1933	May 25, 2018
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2605	November 1, 2012
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8352	September 11, 2009
Inquisitive about SP cores in SMs CUDA Programming and Performance	3	1413	October 1, 2009

How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated?

Related topics