If you are trying understand the microarchitecture of the SM the first step is to never use the term CUDA core or Tensor core as these are marketing terms that help provide a magnitude of performance.
At a high level each SM consists of the following hierarchy (this can change slightly between architectures):
- SM Core
- High level warp scheduler that is concerned with overall scheduling policy.
- SM Sub-partitions (4)
- Warp scheduler that is responsible for cycle by cycle scheduling.
- General register file (allocated per warp across all 32 lanes)
- Uniform register file (allocated per warp)
- Fixed latency math pipelines (fma, fmaheavy, fmalite, fp64, alu, mma (tensor), uniform)
- Variable/long latency math pipelines (xu/mufu/sfu)
- Dispatch port for variable latency math and memory instructions.
- Immediate constant cache
- MIO/L1TEX
- Instructions are enqueued from SM Sub-partition into instructions queues and MIO dispatches to shared variable latency units such as
- fp64pipeline on consumer cards
- idc - indexed constant cache
- lsu - load store unit (global, local shared, dsmem)
- texture units (tex, surface)
- tensor memory accelerator - dma controller
- Instructions are enqueued from SM Sub-partition into instructions queues and MIO dispatches to shared variable latency units such as
On each cycle the SM sub-partition warp schedulers independently select an active, non-stalled warp to issue an instruction. The instruction can be dispatch to one of the fixed latency math paths or after collecting operands enqueued into an instruction queue in the MIO. Some pipelines such as fma* and alu are not 32 lanes wide so a warp may be selected and dispatched over multiple cycles.
On each cycle the MIO unit selects instructions from various shared instruction queues including LSU, TEX, branch unit, and shared math units and dispatches the instruction if the pipeline is available.
A given SM warp scheduler can select a FFMA instruction from any warp and issue it to the fma* pipe. On the next cycle the scheduler can select a HMMA instruction and issue it to the MMA pipe. Dispatch can take multiple cycles. Each of these pipelines is many cycles so these instructions are “in flight” or “executing” on the same cycle.
The maximum instruction issue sustained rate on Volta - Hopper is 1 instruction per cycle per SM sub-partition. The number of instruction is the pipeline can be extremely high due to the depth of some of the pipelines and the number of outstanding memory operations via LSU/TEX to L1TEX unit.
Can you interleave FP16x2 and MMA?
On Volta - GA100 (not GA10x) the FP16x2 math pipeline shares the same dispatch port as the MMA pipelines so there can be contention. Whether it uses the exact same gates or just the wires to read and write the registers is irrelevant to the developer and it changes often.
On GA10x+ with 2x FP32 the fma pipe was split into the fmaheavy pipe (fp32 + imad + a few other instructions_ and fmalite pipe (fp32). fp16x2 vectors are split and simultaneously sent down the fma pipes.
Does this help you as a programmer?
Not really. If you know the instruction issue rate the only reason to care about the pipeline is either (a) specific instructions can only be issued to one pipe (e.g. imad down fmaheavy), or (b) you are using the profiler and need to understand when the pipe utilization is high what instructions are consuming that pipeline.
Can you improve performance by interleaving MMA and FP16x2?
Potentially, but in my opinion even if you have full access to the assembler and know every detail it is really hard to optimize this. The better method is to use multiple warps and allow the warp scheduler to interleave instructions to available pipes. Cutlass warp specialization is a technique that can achieve speed of light for a pipeline by minimizing the number of warps using a pipe so that the pipe can be driven at maximum speed without runtime conflicts such as register file arbitration conflicts (stall_dispatch). On Volta - Ada the MMA unit input and output is the register file. A warp has to (a) load data into the register file, (b) issue MMAs, (c) process the result, (d) write data out. Speed of light is often best performed by multi-buffering this pipeline across warps such that no two warps are in the same pipeline at the same time. Alternatively using shared memory barriers and the asynchonous pipeline warps can be assigned specific roles - load data to shared, issue MMA instructions, process the results, etc.
If you are trying to get the highest performance out of the NVIDIA GPU, especially for Tensors, I recommend reviewing cutlass. In most cases if an accelerated library for your domain exists then use the library unless you find a performance issue or and edge case that is not well supported.
Do NVIDIA GPUs support vector operations?
Yes. The list of hardware vector instructions changes with architectures and chips within the architecture. The CUDA compiler and PTX support vector abstractions that may map to a single SASS instruction or may be emulated as a sequence of instructions. SASS vector instructions start with V (e.g. VIADD) or end with the vector size (e.g. HFMA2 = 2xFP16 FMA).
Learning the Microarchitecture
If you want to learn the details of the microarchitecture there are a few really good methods:
- Write CUDA code and review the PTX and SASS. This can be done in godbolt. Once you have written the code you can manipulate the compiler settings to compile for different compute and sm architectures to see how architecture changes (e.g. 2xFP32 in GA10x) impact instruction schedule.
- Read the PTX manual.
- Write small 5-10 line microbenchmark kernels and look at the SASS and profiler output.
- Read professional and academic papers that breakdown the GPU.
- Single step optimized SASS code (no -G) in the debugger. This is a great way to learn SASS and learn concept such as lane predication, divergent branching, barriers, etc.