A warp scheduler in a SM has a number of warps assigned to it. The warp scheduler looks at all warps assigned to it, to determine which have instructions that are ready to issue. The warp scheduler then chooses 1 or 2 instructions that are ready to execute, and issues those instructions. The process of issuing an instruction involves assigning functional units within an SM to that execution (scheduling) of that instruction, warp-wide. A warp is always 32 threads, therefore 32 functional units in one clock cycle, or a smaller number distributed across multiple clock cycles, must be scheduled (and therefore must be “available”) to issue the instruction.
All functional units are pipelined. Many/most can accept a new instruction of the type they are designed to handle, on each clock cycle. The pipeline depth determines when that instruction completes/retires.
You’ll need to grasp the idea that an SP refers most directly to a floating-point ALU. It handles floating point adds and multiplies, but not other instructions generally speaking. If you have an integer add, for example, an SP would not be scheduled to handle that instruction, instead it would be an integer ALU.
All instructions are issued warp wide, and require 32 functional units of the appropriate type to be scheduled. This can be 32 functional units in a single clock cycle, or e.g. 16 over 2 clock cycles, or 8 over 4 clock cycles, etc.
For the purpose of this discussion I am ignoring tensorcore operations.
Most specifics here are unpublished, and I wouldn’t be able to answer questions like this:
- what are all the different types of functional units in a SM?
- how many of functional unit X are in SM architecture Y?
- what is the pipeline depth of functional unit X?
- what is the exact algorithm by which a warp scheduler chooses instructions to issue?
etc.