In the Turing GPU, an SM consists of 64 CUDAcores, the hardware resource would be split into four portions, and each one owns 16 CUDA cores + 1 warp scheduler/dispatch.
For a warp, it has 32 threads and how is a ready warp executed on a portion? If the next instruction for this warp is a float calculation instruction and all the resources are ready for the warp, the warp completes 32 instructions over two clock, right?
By the way, from a CUDA textbook, there is a rule for active warp to run, which is there are 32 free CUDA cores. In the truing architecture, a warp scheduler only controls 16 CUDA cores, do it violates the rule?
Many thanks for any replies.