Separate CUDA Core pipeline for FP16 and FP32?

dhjoo982 · August 1, 2024, 11:49pm

I am profiling A100 on Nsight Compute.

My previous understanding, based on the diagram of A100 SM, was that fp16 operations on Cuda Cores are packed into half2 and mapped to the 64 FP32 units. (Thus double the FP32 throughput)

However, there is a conflict in the ‘Compute Workload Analysis’ that makes me doubt this. It shows entries for both FMA and FP16 on Compute Workload Analysis > Pipe Utilization (% of peak instructions executed). On details, it mentions that ‘On GA100, fp16 pipeline performs paired fp16 operation’.

I wonder if this fp16 pipeline physically makes use of the FP32 units, or if it is a separate unit that was neglected in the A100 diagram. (figure 7 of A100 white paper nvidia-ampere-architecture-whitepaper.pdf )

Greg · August 2, 2024, 4:27pm

For Volta, Turing, and GA100 the fp16 pipe and the fma pipe are independent. The fp16 pipe is on a shared pipe with the tensor cores. This can be proved by writing a kernel that alternates between issuing fp16 and fp32 instructions.

The fp32 instruction rate is 0.5 instructions/cycle per SM sub-partition.
The fp16 instruction rate is 0.5 instructions per cycle per SM sub-partition.

If the two pipes are shared, then the maximum SM sub-partition (SMSP) IPC is 0.5.
If the two pipes are not shared, then the maximum theoretical SM sub-partition (SMSP) IPC is 1.0.

dhjoo982 · August 2, 2024, 5:05pm

Thank you for the clear answer.

Applying the same logic, and given that fp16 and tensor core share pipe on GA100,

Should I expect a kernel that alternates between fp16 and tensor core instruction to have the average of two instructions’ IPC, as the max theoretical SMSP IPC?

Greg · August 2, 2024, 9:09pm

On GA100 the FP16 pipe, Tensor (*MMA) pipes, and FP64 pipe share a dispatch port. Interleaving instructions of these types will limit throughput of each instruction type.

dhjoo982 · August 4, 2024, 11:34pm

Thank you for the answer.
With this in mind, it seems like the diagram of A100 SM is somewhat incomplete, not showing the physical units for FP16 pipe.

Figure 5 of https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/, the same one from A100 white paper.

dhjoo982 · August 6, 2024, 5:50pm

Having run the said experiment above,

We verified that for A100, FP16 and FP32 pipes are independent, whereas FP16 and Tensor pipes are dependent (shared pipe).

Similar experiment we ran with A6000 (GA102) shows that FP16 pipe and Tensor pipe are independent, while the FP16 pipe and FP32 pipe are not.

Could you verify if:
i) Our observations on A6000 are true, and
ii) which pipes among FP16, FP32, and Tensor are independent on H100?

We would greatly appreciate a proxy into H100 before we actually get them to run the experiments.

Greg · August 6, 2024, 7:55pm

On GA100 (SM8.0)

Shared pipe handles Tensor, FP16, and FP64
FMA pipe handles IMAD, IDP, and FP32 operations.

On GA10x (SM8.6)

First chip family with 2x FP32
Shared pipe handles Tensor operations.
FMAheavy pipe handles IMAD, IDP, and FP32 operations.
FMAlite pipe handles FP32
FP16x2 operations are dual-issued to both FMAheavy and FMAlite pipe.

On GH100 (SM9.0)

Shared pipe handles Tensor and FP64 operations.
Same as GA10s for FMA pipes and FP16x2.

dhjoo982 · August 6, 2024, 9:29pm

Thank you for the response.

By referring to each compute unit as ‘pipe’.
Is there an underlying premise that multiple instructions can be in-flight on each stage at the same time?

E.g. first HFMA inside FP16 pipe stage 3, second HFMA inside FP16 pipe stage2, and third FMA inside FP16 pipe stage1.

This would effectively increase the instruction throughput.

Greg · August 6, 2024, 9:47pm

The “CUDA Core”, “Tensor Core”, FMA, ALU, XU, LSU/TEX pipes are instruction pipelines.

Instruction throughputs, by datatype/operation, are documented documented in the ( CUDA C++ Programming Guide section on Arithmetic Instructions (nvidia.com)). The pipeline and dependent instruction latency are not well documented. High level guidance is stated in the CUDA C++ Programming Guide on Maximize Utilization (nvidia.com).

Execution time varies depending on the instruction. On devices of compute capability 7.x, for most arithmetic instructions, it is typically 4 clock cycles.

rs277 · August 6, 2024, 11:18pm

You may find some useful information in this paper if you’ve not already seen it.

dhjoo982 · August 6, 2024, 11:40pm

Thank you for the suggestions.

I have, but the paper primarily focuses on benchmarking 4th Gen Tensor Cores.
Not much focus is given to CUDA cores.

system · August 20, 2024, 11:41pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
INT 32 and FP64 can be used concurrently in the Volta architecture? CUDA Programming and Performance	5	2521	May 6, 2024
Mapping of pipelines to functional units CUDA Programming and Performance	9	424	April 11, 2025
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2927	October 5, 2022
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1817	September 27, 2024
A Question about how Ampere/Lovelace (RTX 3000/4000, GA10X/AD10X) cards handle Warp Dispatching CUDA Programming and Performance	13	454	June 1, 2024
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1363	July 7, 2023
Is there a way I can tell whether I'm getting concurrent floating point instructions on cc86, cc89, +? CUDA Programming and Performance	4	30	October 14, 2024
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	106	January 19, 2025
Difference in SM performance of float16 and bfloat16 CUDA Programming and Performance	4	754	August 7, 2024
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2456	May 15, 2024

Separate CUDA Core pipeline for FP16 and FP32?

Related topics