I am profiling A100 on Nsight Compute. My previous understanding, based on the diagram of A100 SM, was that fp16 operations on Cuda Cores are packed into half2 and mapped to the 64 FP32 units. (Thus double the FP32 throughput) However, there is a conflict in the ‘Compute Workload Analysis’ that ma…

For Volta, Turing, and GA100 the fp16 pipe and the fma pipe are independent. The fp16 pipe is on a shared pipe with the tensor cores. This can be proved by writing a kernel that alternates between issuing fp16 and fp32 instructions. The fp32 instruction rate is 0.5 instructions/cycle per SM sub-pa…

Separate CUDA Core pipeline for FP16 and FP32?

Developer Tools Nsight Compute

Greg August 6, 2024, 7:55pm 7

On GA100 (SM8.0)

Shared pipe handles Tensor, FP16, and FP64
FMA pipe handles IMAD, IDP, and FP32 operations.

On GA10x (SM8.6)

First chip family with 2x FP32
Shared pipe handles Tensor operations.
FMAheavy pipe handles IMAD, IDP, and FP32 operations.
FMAlite pipe handles FP32
FP16x2 operations are dual-issued to both FMAheavy and FMAlite pipe.

On GH100 (SM9.0)

Shared pipe handles Tensor and FP64 operations.
Same as GA10s for FMA pipes and FP16x2.

Topic		Replies	Views
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	1056	November 25, 2024
Difference in SM performance of float16 and bfloat16 CUDA Programming and Performance	4	1508	August 7, 2024
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	1425	December 10, 2024
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1250	April 26, 2018
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	2723	September 27, 2024
INT 32 and FP64 can be used concurrently in the Volta architecture? CUDA Programming and Performance	5	2747	May 6, 2024
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	7712	August 14, 2024
Mapping of pipelines to functional units CUDA Programming and Performance	9	1561	April 11, 2025
How FP32 and FP16 units are implemented in GP100 GPU's CUDA Programming and Performance	8	7912	March 28, 2017
Which pipeline does FP32-to-FP16 conversion? Nsight Compute	6	950	October 27, 2022

Separate CUDA Core pipeline for FP16 and FP32?

Related topics