How is a warp executed on a SM


Dear all,

In the Turing GPU, an SM consists of 64 CUDAcores, the hardware resource would be split into four portions, and each one owns 16 CUDA cores + 1 warp scheduler/dispatch.

For a warp, it has 32 threads and how is a ready warp executed on a portion? If the next instruction for this warp is a float calculation instruction and all the resources are ready for the warp, the warp completes 32 instructions over two clock, right?

By the way, from a CUDA textbook, there is a rule for active warp to run, which is there are 32 free CUDA cores. In the truing architecture, a warp scheduler only controls 16 CUDA cores, do it violates the rule?

Many thanks for any replies.