A CUDA core is a pipelined math unit (aka datapath) that can execute a FP32 FMA.
The SM is comprised of numerous instruction pipelines (aka datapaths, aka execution units) for FP32, FP64, FP16, INT, Tensor (matrix multiplication and accumulator), bit manipulation, logical operators, data movement, control flow (branching, barriers), and memory access.
On Volta - Hopper each SM has 4 SM sub-partitions (called partitions in the latest whitepapers). Each sub-partition has a warp scheduler and dedicated instruction pipelines for FP32, FP16, INT, data conversion, special function unit, uniform (Turing+) and Tensor operations. 100 class SM may also have additional FP64 instruction pipelines.
The warp scheduler can dispatch instructions to the instruction pipelines in its subpartition or dispatch the instruction to the MIO (memory input/output unit). The MIO unit is responsible for queuing and dispatching instructions to SM shared instruction pipelines/execution units including LSU (load store unit), TEX (texture unit), IDC (indexed constant cache), CBU (control and branch unit), … On GPUs with lighter SMs (10x/20x consumer) the FP64 unit is a SM shared unit.
There is very limited microarchitecture detail on each SM and SM architecture can vary significantly. The 2. Kernel Profiling Guide — NsightCompute 12.6 documentation section on Pipelines contains a list of the common instruction pipelines exposed by the profiler and the type of instructions supported by each pipeline.
On Volta - GA100 (not GA10x) the shared dispatch pipe in each SM sub-partition includes:
- fp16ultra (FP16x2 HF*2)
- tensor pipes (integer, floating point)
- On 100 class fp64lite for DFMA/DADD/DMUL/DSETP
On GA10x - GH100 FP16x2 is handled by the 2xFMA pipe (fmalite and fmaheavy).