How FP32 and FP16 units are implemented in GP100 GPU's

Varun1312 · March 27, 2017, 8:27am

The GP100 GPU’s based on Pascal architecture has a performance of 10.6 Tflops of FP32 performance and 21.2 TFLops of FP16 performance. The representation of FP16 and FP32 numbers is quite different i.e. same number has different bit pattern in FP32 and FP16 (unlike integers where a 16-bit integer has same bit pattern even in 32-bit representation except for leading zeros).

How are the floating point units in GP100 implemented so that nearly twice the speedup is achieved by moving from FP32 to FP16.

Robert_Crovella · March 27, 2017, 10:51am

They are represented as the corresponding IEEE-754 datatype indicates.

CUDA refers to this as the half datatype.

There is also a half2 vector type which is expected for max performance of some operations.

Refer to cuda_fp16.h header file.

[url]https://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/[/url]

Varun1312 · March 27, 2017, 11:28am

Thanks for the quick response.

My question was from an architecture perspective.

If GP100 supports half-precision units, then there is a piece of hardware that can “decode” the half-precision format and do the computation. Similarly, for single-precision, there should exist a piece of hardware that decodes IEEE-754 single precision format and performs some computation.

Are the half-precision computation units and single-precision computation units related in any way? More specifically, is one single precision computation unit composed of two half-precision computation units?

Can you kindly point me to any documentation related to this implementation?

njuffa · March 27, 2017, 1:31pm

I am not aware of NVIDIA documentation that explains the microarchitecture to that level. However, for recent generations of NVIDIA GPUs, the wide range of relative computational throughputs suggests that FP16, FP32, and FP64 units are built as separates entities, which allows NVIDIA to compose processors with the throughput profile required for particular market segments. This is conjecture, as I stated.

It is possible to build shared units to re-use relative expensive hardware like multiplier arrays, and various such schemes are described in the literature.

If this design decision (shared vs separate hardware for different floating-point formats) were documented, how would you take advantage of it?

Varun1312 · March 27, 2017, 1:57pm

I was just curious about the implementation as the performance of FP16 is exactly twice that of FP32.

While having separate computational units for FP16, FP32 and FP64 units is a possible option, it will only increase the silicon area.

The other option is to share the units but it may not result in exact scaling.

njuffa · March 27, 2017, 2:14pm

The design philosophy NVIDIA seems to use is to build a base configuration with full FP32 performance (which is needed for 3D graphics, their bread & butter business), then bolt on additional units (FP16, FP64) for professional markets, where the additional revenue per part (thousands of dollars) more than makes up for higher die costs (hundreds of dollars).

This approach allows their consumer line to compete on price, while allowing their professional line to compete on performance and features.

Again, conjecture on my part.

SPWorley · March 27, 2017, 7:58pm

As njuffa points out, the actual implementation hardware is pretty much unimportant to us for programming since it’s all abstracted away from us. And NVidia rarely gives much detail.

But, for your very specific question of whether GP100’s FP16 and FP32 ALUs are shared in the same hardware sub unit, the GP100 whitepaper (surprisingly) does actually answer that exact question: “One new capability that has been added to GP100’s FP32 CUDA Cores is the ability to process both 16-bit and 32-bit precision instructions and data.”

njuffa · March 27, 2017, 8:00pm

Thanks for the pointer. I stand corrected.

Varun1312 · March 28, 2017, 6:30am

Thanks for the inputs.

Topic		Replies	Views
fp16 vs fp32 CUDA Programming and Performance	3	3948	November 13, 2017
Mixed-Precision Programming with CUDA 8 Technical Blog	1	391	February 23, 2017
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2545	August 12, 2017
Nvidia announces Tesla V100 (Volta) CUDA Programming and Performance	19	5235	November 30, 2017
16 bit float operations CUDA Programming and Performance	2	7622	April 7, 2015
Difference in SM performance of float16 and bfloat16 CUDA Programming and Performance	4	747	August 7, 2024
Ranking GPUs based on their GPU performance CUDA Programming and Performance tensorrt , inference-server-triton , tao	2	157	February 11, 2025
Floating Point Accuracy CUDA Programming and Performance	11	30431	April 6, 2013
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	447	August 20, 2024
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1124	April 26, 2018

How FP32 and FP16 units are implemented in GP100 GPU's

Related topics