Mapping of pipelines to functional units

Greg · November 29, 2024, 8:30pm

A CUDA core is a pipelined math unit (aka datapath) that can execute a FP32 FMA.

The SM is comprised of numerous instruction pipelines (aka datapaths, aka execution units) for FP32, FP64, FP16, INT, Tensor (matrix multiplication and accumulator), bit manipulation, logical operators, data movement, control flow (branching, barriers), and memory access.

On Volta - Hopper each SM has 4 SM sub-partitions (called partitions in the latest whitepapers). Each sub-partition has a warp scheduler and dedicated instruction pipelines for FP32, FP16, INT, data conversion, special function unit, uniform (Turing+) and Tensor operations. 100 class SM may also have additional FP64 instruction pipelines.

The warp scheduler can dispatch instructions to the instruction pipelines in its subpartition or dispatch the instruction to the MIO (memory input/output unit). The MIO unit is responsible for queuing and dispatching instructions to SM shared instruction pipelines/execution units including LSU (load store unit), TEX (texture unit), IDC (indexed constant cache), CBU (control and branch unit), … On GPUs with lighter SMs (10x/20x consumer) the FP64 unit is a SM shared unit.

There is very limited microarchitecture detail on each SM and SM architecture can vary significantly. The 2. Kernel Profiling Guide — NsightCompute 12.6 documentation section on Pipelines contains a list of the common instruction pipelines exposed by the profiler and the type of instructions supported by each pipeline.

On Volta - GA100 (not GA10x) the shared dispatch pipe in each SM sub-partition includes:

fp16ultra (FP16x2 HF*2)
tensor pipes (integer, floating point)
On 100 class fp64lite for DFMA/DADD/DMUL/DSETP

On GA10x - GH100 FP16x2 is handled by the 2xFMA pipe (fmalite and fmaheavy).

Topic		Replies	Views
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	3608	October 5, 2022
What's cuda cores? CUDA Programming and Performance	0	802	May 19, 2023
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	25106	September 6, 2009
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	2535	September 27, 2024
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	874	August 20, 2024
Concurrent execution of CUDA and Tensor cores CUDA Programming and Performance	34	9231	November 3, 2024
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15802	February 4, 2011
max number of threads per core CUDA Programming and Performance	1	1339	May 8, 2019
Question regarding Pascal architecture CUDA Programming and Performance	13	3137	March 16, 2017
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1642	July 7, 2023

Mapping of pipelines to functional units

Related topics