questions about sp and sm

siri · June 19, 2019, 1:37pm

Hello everyone, i am confusing about GPU HW.

I know a SM can hold many warps, but only one warp can execute really, and actually SP run real thread.
Because a SM usually has 8 SPs, which means if a warp run on one SM, a SP need to run 4 threads, right? so if a SM has more SPs, like 16, then a SP run 2 threads?

Another question is, in a four stage pipeline, SM fetch multiple instructions, send to SP, and SP execute instructions, then write back，if SM fetch multi instructions，should SP execute these instructions？

Thanks in advance!

Robert_Crovella · June 19, 2019, 1:50pm

Most of your statements are wrong.

More than one warp can execute.
SP does not run a whole thread. It is a functional unit that runs a particular instruction type.
SM usually has many more than 8 SPs
A SP does not run 4 threads. It does not even run one whole thread.

cbuchner1 · June 19, 2019, 1:52pm

Isn’t this really outdated information from the sm_1x microarchitecture (nvidia GTX 8800, Geforce 8 series etc…)? I believe this one had 8 SP (CUDA Cores) per SM, with these shader clusters running at twice the clock speed as the rest of the chip. Per cycle the SM could execute a “half warp” consisting of 16 threads. On successive cycles the SM would process one instruction for a full warp of 32 threads. Then the instruction scheduler in the SM would switch to another instruction from one of the warps eligible for execution.

For the current state of things, please have a look at

here SP (CUDA cores) per SM is noted as 64 and 128, depending on whether we look at 1080Ti or RT2080Ti

siri · June 19, 2019, 4:33pm

so how a warp run on SM? All SP participate a warp’s running? I think ALU in sp, so no need to run all SP for a warp.

siri · June 19, 2019, 4:42pm

Isn’t this really outdated information from the sm_1x microarchitecture (nvidia GTX 8800, Geforce 8 series etc…)? I believe this one had 8 SP (CUDA Cores) per SM, with these shader clusters running at twice the clock speed as the rest of the chip. Per cycle the SM could execute a “half warp” consisting of 16 threads. On successive cycles the SM would process one instruction for a full warp of 32 threads. Then the instruction scheduler in the SM would switch to another instruction from one of the warps eligible for execution.

For the current state of things, please have a look at

https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

here SP (CUDA cores) per SM is noted as 64 and 128, depending on whether we look at 1080Ti or RT2080Ti

yes, i just give a situation, my point is how sp run for a warp, should threads be split equally on all SPs, or just part of sp running for a warp

Robert_Crovella · June 19, 2019, 10:01pm

A warp scheduler in a SM has a number of warps assigned to it. The warp scheduler looks at all warps assigned to it, to determine which have instructions that are ready to issue. The warp scheduler then chooses 1 or 2 instructions that are ready to execute, and issues those instructions. The process of issuing an instruction involves assigning functional units within an SM to that execution (scheduling) of that instruction, warp-wide. A warp is always 32 threads, therefore 32 functional units in one clock cycle, or a smaller number distributed across multiple clock cycles, must be scheduled (and therefore must be “available”) to issue the instruction.

All functional units are pipelined. Many/most can accept a new instruction of the type they are designed to handle, on each clock cycle. The pipeline depth determines when that instruction completes/retires.

You’ll need to grasp the idea that an SP refers most directly to a floating-point ALU. It handles floating point adds and multiplies, but not other instructions generally speaking. If you have an integer add, for example, an SP would not be scheduled to handle that instruction, instead it would be an integer ALU.

All instructions are issued warp wide, and require 32 functional units of the appropriate type to be scheduled. This can be 32 functional units in a single clock cycle, or e.g. 16 over 2 clock cycles, or 8 over 4 clock cycles, etc.

For the purpose of this discussion I am ignoring tensorcore operations.

Most specifics here are unpublished, and I wouldn’t be able to answer questions like this:

what are all the different types of functional units in a SM?
how many of functional unit X are in SM architecture Y?
what is the pipeline depth of functional unit X?
what is the exact algorithm by which a warp scheduler chooses instructions to issue?

etc.

Topic		Replies	Views
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8384	September 11, 2009
Inquisitive about SP cores in SMs CUDA Programming and Performance	3	1432	October 1, 2009
1 MP has 8 SP, but warp size is 32! CUDA Programming and Performance	6	3516	January 22, 2009
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1473	April 15, 2009
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2838	March 2, 2009
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5629	July 28, 2009
SP and Warp CUDA Programming and Performance	3	3457	May 2, 2010
Wrap size depending on the number of SP/SM CUDA Programming and Performance	1	11502	March 10, 2011
Thread Scheduling Concept CUDA Programming and Performance	3	3815	June 21, 2012
help me understand cuda CUDA Programming and Performance	4	6922	February 10, 2010

questions about sp and sm

Related topics