The Turing SM is partitioned into four processing blocks depicted as the image.
So I am confused on two questinos:
One warp is scheduled on a processing block in one SM?
If No.1 is yes, then what the hardware do when 32 integers is calculated in a warp? because there is only 16 int cores in a processing block , and the other 16 float cores keep idle? and compute 32 integers by executing the 16 int cores tow times?
Yes. When a threadblock is deposited on a SM by the CWD/block scheduler, the warps in that threadblock are statically assigned to SMSPs (SM sub-partitions). Each sub-partition has a single warp scheduler, so this is like saying the warps are statically assigned to each of the warp schedulers. If there is only one warp scheduler, all warps will be assigned to that. If there are two warp schedulers, about half of the warps will be assigned to one (assuming the SM is empty) and about half will be assigned to the other. If there are 4 warp schedulers in the SM, and assuming an initially empty SM, then the warps will be distributed approximately 1/4 to each warp scheduler. Certain functional unit resources in a SM are also partitioned between the SMSPs. So a SM with 64 “cuda cores” and 4 warp schedulers means that each SMSP/warp scheduler actually only has 16 “cuda cores” to use or assign instructions to.
A warp scheduler always schedules (i.e. issues) instructions warp-wide. Any time a warp scheduler needs to schedule an instruction for which there are less than 32 of the corresponding supporting functional units available, the warp scheduler will schedule that instruction over multiple clock cycles. If there are 16 units available, it will take 2 cycles. If there are 8 units available, it will take 4 cycles. If there are 4 units available, it will take 8 cycles, and if there are 2 units available (such as would be the case for a FP64 instruction) it will take 16 cycles, to schedule that instruction.