My previous understanding, based on the diagram of A100 SM, was that fp16 operations on Cuda Cores are packed into half2 and mapped to the 64 FP32 units. (Thus double the FP32 throughput)

However, there is a conflict in the ‘Compute Workload Analysis’ that makes me doubt this. It shows entries for both FMA and FP16 on Compute Workload Analysis > Pipe Utilization (% of peak instructions executed). On details, it mentions that ‘On GA100, fp16 pipeline performs paired fp16 operation’.

I wonder if this fp16 pipeline physically makes use of the FP32 units, or if it is a separate unit that was neglected in the A100 diagram. (figure 7 of A100 white paper nvidia-ampere-architecture-whitepaper.pdf )

For Volta, Turing, and GA100 the fp16 pipe and the fma pipe are independent. The fp16 pipe is on a shared pipe with the tensor cores. This can be proved by writing a kernel that alternates between issuing fp16 and fp32 instructions.

The fp32 instruction rate is 0.5 instructions/cycle per SM sub-partition.

The fp16 instruction rate is 0.5 instructions per cycle per SM sub-partition.

If the two pipes are shared, then the maximum SM sub-partition (SMSP) IPC is 0.5.
If the two pipes are not shared, then the maximum theoretical SM sub-partition (SMSP) IPC is 1.0.

Applying the same logic, and given that fp16 and tensor core share pipe on GA100,

Should I expect a kernel that alternates between fp16 and tensor core instruction to have the average of two instructions’ IPC, as the max theoretical SMSP IPC?

On GA100 the FP16 pipe, Tensor (*MMA) pipes, and FP64 pipe share a dispatch port. Interleaving instructions of these types will limit throughput of each instruction type.

Thank you for the answer.
With this in mind, it seems like the diagram of A100 SM is somewhat incomplete, not showing the physical units for FP16 pipe.

By referring to each compute unit as ‘pipe’.
Is there an underlying premise that multiple instructions can be in-flight on each stage at the same time?

E.g. first HFMA inside FP16 pipe stage 3, second HFMA inside FP16 pipe stage2, and third FMA inside FP16 pipe stage1.

This would effectively increase the instruction throughput.

Execution time varies depending on the instruction. On devices of compute capability 7.x, for most arithmetic instructions, it is typically 4 clock cycles.