Separate CUDA Core pipeline for FP16 and FP32?

I am profiling A100 on Nsight Compute.

My previous understanding, based on the diagram of A100 SM, was that fp16 operations on Cuda Cores are packed into half2 and mapped to the 64 FP32 units. (Thus double the FP32 throughput)

However, there is a conflict in the ‘Compute Workload Analysis’ that makes me doubt this. It shows entries for both FMA and FP16 on Compute Workload Analysis > Pipe Utilization (% of peak instructions executed). On details, it mentions that ‘On GA100, fp16 pipeline performs paired fp16 operation’.

I wonder if this fp16 pipeline physically makes use of the FP32 units, or if it is a separate unit that was neglected in the A100 diagram. (figure 7 of A100 white paper nvidia-ampere-architecture-whitepaper.pdf )

For Volta, Turing, and GA100 the fp16 pipe and the fma pipe are independent. The fp16 pipe is on a shared pipe with the tensor cores. This can be proved by writing a kernel that alternates between issuing fp16 and fp32 instructions.

  • The fp32 instruction rate is 0.5 instructions/cycle per SM sub-partition.
  • The fp16 instruction rate is 0.5 instructions per cycle per SM sub-partition.

If the two pipes are shared, then the maximum SM sub-partition (SMSP) IPC is 0.5.
If the two pipes are not shared, then the maximum theoretical SM sub-partition (SMSP) IPC is 1.0.

1 Like

Thank you for the clear answer.

Applying the same logic, and given that fp16 and tensor core share pipe on GA100,

Should I expect a kernel that alternates between fp16 and tensor core instruction to have the average of two instructions’ IPC, as the max theoretical SMSP IPC?

On GA100 the FP16 pipe, Tensor (*MMA) pipes, and FP64 pipe share a dispatch port. Interleaving instructions of these types will limit throughput of each instruction type.

Thank you for the answer.
With this in mind, it seems like the diagram of A100 SM is somewhat incomplete, not showing the physical units for FP16 pipe.

Figure 5 of https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/, the same one from A100 white paper.

Having run the said experiment above,

We verified that for A100, FP16 and FP32 pipes are independent, whereas FP16 and Tensor pipes are dependent (shared pipe).

Similar experiment we ran with A6000 (GA102) shows that FP16 pipe and Tensor pipe are independent, while the FP16 pipe and FP32 pipe are not.

Could you verify if:
i) Our observations on A6000 are true, and
ii) which pipes among FP16, FP32, and Tensor are independent on H100?

We would greatly appreciate a proxy into H100 before we actually get them to run the experiments.

On GA100 (SM8.0)

  • Shared pipe handles Tensor, FP16, and FP64
  • FMA pipe handles IMAD, IDP, and FP32 operations.

On GA10x (SM8.6)

  • First chip family with 2x FP32
  • Shared pipe handles Tensor operations.
  • FMAheavy pipe handles IMAD, IDP, and FP32 operations.
  • FMAlite pipe handles FP32
  • FP16x2 operations are dual-issued to both FMAheavy and FMAlite pipe.

On GH100 (SM9.0)

  • Shared pipe handles Tensor and FP64 operations.
  • Same as GA10s for FMA pipes and FP16x2.

Thank you for the response.

By referring to each compute unit as ‘pipe’.
Is there an underlying premise that multiple instructions can be in-flight on each stage at the same time?

E.g. first HFMA inside FP16 pipe stage 3, second HFMA inside FP16 pipe stage2, and third FMA inside FP16 pipe stage1.

This would effectively increase the instruction throughput.

The “CUDA Core”, “Tensor Core”, FMA, ALU, XU, LSU/TEX pipes are instruction pipelines.

Instruction throughputs, by datatype/operation, are documented documented in the ( CUDA C++ Programming Guide section on Arithmetic Instructions (nvidia.com)). The pipeline and dependent instruction latency are not well documented. High level guidance is stated in the CUDA C++ Programming Guide on Maximize Utilization (nvidia.com).

Execution time varies depending on the instruction. On devices of compute capability 7.x, for most arithmetic instructions, it is typically 4 clock cycles.

You may find some useful information in this paper if you’ve not already seen it.

Thank you for the suggestions.

I have, but the paper primarily focuses on benchmarking 4th Gen Tensor Cores.
Not much focus is given to CUDA cores.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.