Yes. When a threadblock is deposited on a SM by the CWD/block scheduler, the warps in that threadblock are statically assigned to SMSPs (SM sub-partitions). Each sub-partition has a single warp scheduler, so this is like saying the warps are statically assigned to each of the warp schedulers. If there is only one warp scheduler, all warps will be assigned to that. If there are two warp schedulers, about half of the warps will be assigned to one (assuming the SM is empty) and about half will be assigned to the other. If there are 4 warp schedulers in the SM, and assuming an initially empty SM, then the warps will be distributed approximately 1/4 to each warp scheduler. Certain functional unit resources in a SM are also partitioned between the SMSPs. So a SM with 64 “cuda cores” and 4 warp schedulers means that each SMSP/warp scheduler actually only has 16 “cuda cores” to use or assign instructions to.
A warp scheduler always schedules (i.e. issues) instructions warp-wide. Any time a warp scheduler needs to schedule an instruction for which there are less than 32 of the corresponding supporting functional units available, the warp scheduler will schedule that instruction over multiple clock cycles. If there are 16 units available, it will take 2 cycles. If there are 8 units available, it will take 4 cycles. If there are 4 units available, it will take 8 cycles, and if there are 2 units available (such as would be the case for a FP64 instruction) it will take 16 cycles, to schedule that instruction.