questions about sp and sm

Robert_Crovella · June 19, 2019, 10:01pm

A warp scheduler in a SM has a number of warps assigned to it. The warp scheduler looks at all warps assigned to it, to determine which have instructions that are ready to issue. The warp scheduler then chooses 1 or 2 instructions that are ready to execute, and issues those instructions. The process of issuing an instruction involves assigning functional units within an SM to that execution (scheduling) of that instruction, warp-wide. A warp is always 32 threads, therefore 32 functional units in one clock cycle, or a smaller number distributed across multiple clock cycles, must be scheduled (and therefore must be “available”) to issue the instruction.

All functional units are pipelined. Many/most can accept a new instruction of the type they are designed to handle, on each clock cycle. The pipeline depth determines when that instruction completes/retires.

You’ll need to grasp the idea that an SP refers most directly to a floating-point ALU. It handles floating point adds and multiplies, but not other instructions generally speaking. If you have an integer add, for example, an SP would not be scheduled to handle that instruction, instead it would be an integer ALU.

All instructions are issued warp wide, and require 32 functional units of the appropriate type to be scheduled. This can be 32 functional units in a single clock cycle, or e.g. 16 over 2 clock cycles, or 8 over 4 clock cycles, etc.

For the purpose of this discussion I am ignoring tensorcore operations.

Most specifics here are unpublished, and I wouldn’t be able to answer questions like this:

what are all the different types of functional units in a SM?
how many of functional unit X are in SM architecture Y?
what is the pipeline depth of functional unit X?
what is the exact algorithm by which a warp scheduler chooses instructions to issue?

etc.

Topic		Replies	Views
Inquisitive about SP cores in SMs CUDA Programming and Performance	3	1406	October 1, 2009
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8327	September 11, 2009
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2173	March 19, 2011
1 MP has 8 SP, but warp size is 32! CUDA Programming and Performance	6	3446	January 22, 2009
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1427	April 15, 2009
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2820	March 2, 2009
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	3009	March 7, 2017
Thread Scheduling Concept CUDA Programming and Performance	3	3727	June 21, 2012
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24855	September 6, 2009
SP and Warp CUDA Programming and Performance	3	3419	May 2, 2010

questions about sp and sm

Related topics