questions about sp and sm

Hello everyone, i am confusing about GPU HW.

I know a SM can hold many warps, but only one warp can execute really, and actually SP run real thread.
Because a SM usually has 8 SPs, which means if a warp run on one SM, a SP need to run 4 threads, right? so if a SM has more SPs, like 16, then a SP run 2 threads?

Another question is, in a four stage pipeline, SM fetch multiple instructions, send to SP, and SP execute instructions, then write back,if SM fetch multi instructions,should SP execute these instructions?

Thanks in advance!

Most of your statements are wrong.

More than one warp can execute.
SP does not run a whole thread. It is a functional unit that runs a particular instruction type.
SM usually has many more than 8 SPs
A SP does not run 4 threads. It does not even run one whole thread.

Isn’t this really outdated information from the sm_1x microarchitecture (nvidia GTX 8800, Geforce 8 series etc…)? I believe this one had 8 SP (CUDA Cores) per SM, with these shader clusters running at twice the clock speed as the rest of the chip. Per cycle the SM could execute a “half warp” consisting of 16 threads. On successive cycles the SM would process one instruction for a full warp of 32 threads. Then the instruction scheduler in the SM would switch to another instruction from one of the warps eligible for execution.

For the current state of things, please have a look at

https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

here SP (CUDA cores) per SM is noted as 64 and 128, depending on whether we look at 1080Ti or RT2080Ti

so how a warp run on SM? All SP participate a warp’s running? I think ALU in sp, so no need to run all SP for a warp.

yes, i just give a situation, my point is how sp run for a warp, should threads be split equally on all SPs, or just part of sp running for a warp

A warp scheduler in a SM has a number of warps assigned to it. The warp scheduler looks at all warps assigned to it, to determine which have instructions that are ready to issue. The warp scheduler then chooses 1 or 2 instructions that are ready to execute, and issues those instructions. The process of issuing an instruction involves assigning functional units within an SM to that execution (scheduling) of that instruction, warp-wide. A warp is always 32 threads, therefore 32 functional units in one clock cycle, or a smaller number distributed across multiple clock cycles, must be scheduled (and therefore must be “available”) to issue the instruction.

All functional units are pipelined. Many/most can accept a new instruction of the type they are designed to handle, on each clock cycle. The pipeline depth determines when that instruction completes/retires.

You’ll need to grasp the idea that an SP refers most directly to a floating-point ALU. It handles floating point adds and multiplies, but not other instructions generally speaking. If you have an integer add, for example, an SP would not be scheduled to handle that instruction, instead it would be an integer ALU.

All instructions are issued warp wide, and require 32 functional units of the appropriate type to be scheduled. This can be 32 functional units in a single clock cycle, or e.g. 16 over 2 clock cycles, or 8 over 4 clock cycles, etc.

For the purpose of this discussion I am ignoring tensorcore operations.

Most specifics here are unpublished, and I wouldn’t be able to answer questions like this:

  • what are all the different types of functional units in a SM?
  • how many of functional unit X are in SM architecture Y?
  • what is the pipeline depth of functional unit X?
  • what is the exact algorithm by which a warp scheduler chooses instructions to issue?

etc.

1 Like