I know there there are a couple of topics out there on this issue already but I’m still trying to get a clear picture about CUDA cores/functional units and pipelines and how they relate to each other. I’m trying to identify ways to further improve the utilization of my GPUs.
I’m aware that some of these details below are architecture dependent but I want to understand the general concept which I believe should remain fairly consistent across architectures.
According to [1], “CUDA cores” is a marketing term that helps to provide a magnitude of performance. My understanding is that the INT32, FP32, FP64 cores often depicted in the whitepapers of the different architectures (e.g Figure 7 in [2]) don’t actually physically exist on the GPU. Instead these “CUDA cores” are made up of some physical “functional units” (which NVIDIA does not disclose) that are actually executing additions, multiplications, etc. in hardware. Is this assumption correct?
I’m trying to understand how the different pipelines use these functional units. So I have a few questions:
How do functional units map to pipelines?
Do pipelines get assigned a disjoint set of functional units, or can multiple pipelines share the same units and potentially interfere with each other?
If I for example see in the Compute Workload Analysis of Nsight Compute that the FMA pipeline is running close to its peak performance, does this mean other pipelines (e.g ALU) might be performance-limited because their potential functional units are already saturated by the FMA pipeline?
[1] suggests that FMA, FP64, ALU, etc. are fixed latency math pipelines. In addition [1] and [4] mention that some pipelines (e.g FP16, Tensor/MMA and FP64) share the same dispatch port, which could cause contention.
Is there any documentation about which pipelines share dispatch ports, or do you have to figure this out through benchmarking?
What is the design reasoning behind sharing a dispatch port between pipelines? Could this mean these pipelines are sharing a set of functional units, with the dispatcher coordinating their scheduling across the shared set of functional units?
Related to the question above, given that these math pipelines are fixed latency, does that mean once an instruction is dispatched, it is free from any source of interference until it has completed? In other words it can no longer interfere with an instruction from any other pipeline that would be using the same functional unit?
Lastly, documents like [3] talk about a “datapath” in the context of the 2x FP32 Processing. Could someone please briefly explain how this relates to the pipelines?
A CUDA core is a pipelined math unit (aka datapath) that can execute a FP32 FMA.
The SM is comprised of numerous instruction pipelines (aka datapaths, aka execution units) for FP32, FP64, FP16, INT, Tensor (matrix multiplication and accumulator), bit manipulation, logical operators, data movement, control flow (branching, barriers), and memory access.
On Volta - Hopper each SM has 4 SM sub-partitions (called partitions in the latest whitepapers). Each sub-partition has a warp scheduler and dedicated instruction pipelines for FP32, FP16, INT, data conversion, special function unit, uniform (Turing+) and Tensor operations. 100 class SM may also have additional FP64 instruction pipelines.
The warp scheduler can dispatch instructions to the instruction pipelines in its subpartition or dispatch the instruction to the MIO (memory input/output unit). The MIO unit is responsible for queuing and dispatching instructions to SM shared instruction pipelines/execution units including LSU (load store unit), TEX (texture unit), IDC (indexed constant cache), CBU (control and branch unit), … On GPUs with lighter SMs (10x/20x consumer) the FP64 unit is a SM shared unit.
There is very limited microarchitecture detail on each SM and SM architecture can vary significantly. The 2. Kernel Profiling Guide — NsightCompute 12.6 documentation section on Pipelines contains a list of the common instruction pipelines exposed by the profiler and the type of instructions supported by each pipeline.
On Volta - GA100 (not GA10x) the shared dispatch pipe in each SM sub-partition includes:
fp16ultra (FP16x2 HF*2)
tensor pipes (integer, floating point)
On 100 class fp64lite for DFMA/DADD/DMUL/DSETP
On GA10x - GH100 FP16x2 is handled by the 2xFMA pipe (fmalite and fmaheavy).
By observation the INT32 pipeline seems to comprise multiple functional units where there are differences between GPU architectures whether these sub-units are unified and have identical throughput or are separated with differing throughput :
(1) three-input integer adder
(2) three-input integer multiply-add
(3) three-input funnel shifter
(4) three-input LUT-based logic op
(5) miscellaneous operations like byte permutation, population count, count-leading-zeros, and bit reversal
The above should be seen as a non-exhaustive list. For example, it is not clear to me where the LEA instruction (a shift-add type instructions) has its home.
Thanks a lot for both of your responses, but what I’m still not sure about is whether these instruction pipelines are physically independent in the sense that they do not share any underlying hardware in order to execute the instructions of their respective pipelines? Or could two pipelines interfere with each other, not at the level of the dispatcher which they might share, but at the level of the functional units that underly each of these pipelines. Is this not a statement that one can generally make because it depends on the architecture?
So for example, when you talk about the INT32 pipeline, I’m assuming you’re referring to all the instruction pipelines that can execute 32-bit integer operations, right? Do any of these pipelines share any of the functional units that you listed, like the three-input integer adder or three-input integer multiply-adder or do they each belong to a single pipeline?
Sorry I’m not trying to needpick here, I’m just trying to see whether I have the right picture of how instructions are executed at the sub-partition level and in what ways they may interfere.
NVIDIA is very tight-lipped about any implementation details and has pursued this approach consistently and with conviction. You can find little tidbits sprinkled throughout the official documentation and some presentations. You can infer a bit more by studying the throughput information in the programming guide, and by devising microbenchmarks, reverse engineering instructions sets etc.
That much was clear to me. The question is, which functional unit handles it, IMAD?
Without knowing the full internals (which can also differ between architectures and models):
All the older GPU generations before Volta had more complicated scheduling and dispatching. Each scheduler could issue up to 2 instructions (except for cc. 2.0) and Fermi had 2 schedulers and Kepler 4 schedulers.
Volta (and all the architecture afterwards, which are basically still Volta) simplified that a lot, each of the 4 partitions has one scheduler, which can issue 1 new instruction per cycle.
The execution pipelines are high-level and each can start working on a certain number of lanes per cycle, e.g. 16 lanes/cycle, then it takes 2 cycles to fill a new instruction into the pipeline. The latency of course is higher (pipelined architecture).
Sometimes the pipelines later split into sub-pipelines, most often not. Those are implementation details. Also whether hardware logic is shared or not between different instructions. As the high-level pipeline is occupied anyway, it has no effect on the performance of the program.
It is important only in the cases, where different high-level pipelines can execute similar operations, e.g. fmaheavy vs. fmalite vs. tensor cores.
IMUL and IMAD are done by the FMA pipeline according to
Fused Multiply Add/Accumulate. The FMA pipeline processes most FP32 arithmetic (FADD, FMUL, FMAD). It also performs integer multiplication operations (IMUL, IMAD), as well as integer dot products.
LEA is done by the integer pipeline and not the FMA pipeline, perhaps the adder and shifter (1 and 3) are fused and can be operated separately or as a unit?
There is a published Nvidia patent about the ALU, which could give general clues about the workings of the Cuda ALU (have not read it):
But even if the patent describes something, it does not mean that it is used in past, current or future Nvidia hardware in exactly that way.