Hi all,
I know there there are a couple of topics out there on this issue already but I’m still trying to get a clear picture about CUDA cores/functional units and pipelines and how they relate to each other. I’m trying to identify ways to further improve the utilization of my GPUs.
I’m aware that some of these details below are architecture dependent but I want to understand the general concept which I believe should remain fairly consistent across architectures.
According to [1], “CUDA cores” is a marketing term that helps to provide a magnitude of performance. My understanding is that the INT32, FP32, FP64 cores often depicted in the whitepapers of the different architectures (e.g Figure 7 in [2]) don’t actually physically exist on the GPU. Instead these “CUDA cores” are made up of some physical “functional units” (which NVIDIA does not disclose) that are actually executing additions, multiplications, etc. in hardware. Is this assumption correct?
I’m trying to understand how the different pipelines use these functional units. So I have a few questions:
- How do functional units map to pipelines?
- Do pipelines get assigned a disjoint set of functional units, or can multiple pipelines share the same units and potentially interfere with each other?
If I for example see in the Compute Workload Analysis of Nsight Compute that the FMA pipeline is running close to its peak performance, does this mean other pipelines (e.g ALU) might be performance-limited because their potential functional units are already saturated by the FMA pipeline?
[1] suggests that FMA, FP64, ALU, etc. are fixed latency math pipelines. In addition [1] and [4] mention that some pipelines (e.g FP16, Tensor/MMA and FP64) share the same dispatch port, which could cause contention.
- Is there any documentation about which pipelines share dispatch ports, or do you have to figure this out through benchmarking?
- What is the design reasoning behind sharing a dispatch port between pipelines? Could this mean these pipelines are sharing a set of functional units, with the dispatcher coordinating their scheduling across the shared set of functional units?
- Related to the question above, given that these math pipelines are fixed latency, does that mean once an instruction is dispatched, it is free from any source of interference until it has completed? In other words it can no longer interfere with an instruction from any other pipeline that would be using the same functional unit?
Lastly, documents like [3] talk about a “datapath” in the context of the 2x FP32 Processing. Could someone please briefly explain how this relates to the pipelines?
I appreciate your help!
[2] NVIDIA H100 Tensor Core GPU Architecture Overview
[3] https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf