The Turing SM is partitioned into four processing blocks depicted as the image.
So I am confused on two questinos:
One warp is scheduled on a processing block in one SM?
If No.1 is yes, then what the hardware do when 32 integers is calculated in a warp? because there is only 16 int cores in a processing block , and the other 16 float cores keep idle? and compute 32 integers by executing the 16 int cores tow times?
Yes. When a threadblock is deposited on a SM by the CWD/block scheduler, the warps in that threadblock are statically assigned to SMSPs (SM sub-partitions). Each sub-partition has a single warp scheduler, so this is like saying the warps are statically assigned to each of the warp schedulers. If there is only one warp scheduler, all warps will be assigned to that. If there are two warp schedulers, about half of the warps will be assigned to one (assuming the SM is empty) and about half will be assigned to the other. If there are 4 warp schedulers in the SM, and assuming an initially empty SM, then the warps will be distributed approximately 1/4 to each warp scheduler. Certain functional unit resources in a SM are also partitioned between the SMSPs. So a SM with 64 “cuda cores” and 4 warp schedulers means that each SMSP/warp scheduler actually only has 16 “cuda cores” to use or assign instructions to.
A warp scheduler always schedules (i.e. issues) instructions warp-wide. Any time a warp scheduler needs to schedule an instruction for which there are less than 32 of the corresponding supporting functional units available, the warp scheduler will schedule that instruction over multiple clock cycles. If there are 16 units available, it will take 2 cycles. If there are 8 units available, it will take 4 cycles. If there are 4 units available, it will take 8 cycles, and if there are 2 units available (such as would be the case for a FP64 instruction) it will take 16 cycles, to schedule that instruction.
As you said, for a warp with 32 active threads, only 16 units are available, it will take 2 cycles. Then, is it true that thread#0~thread#15 will be executed at cycle#0 and thread#16 and thread#31 will be executed at cycle#1?
Yes, something like that. I don’t know that the detailed behavior is well specified or published, but AFAIK the low level behavior is that in the first cycle, the 16 units will begin processing 16 threads, and in the next cycle the 16 units will begin processing the next 16 threads. 2 cycles. I don’t know which threads go in which cycle, or how decisions are made about that.