The GP100 GPU’s based on Pascal architecture has a performance of 10.6 Tflops of FP32 performance and 21.2 TFLops of FP16 performance. The representation of FP16 and FP32 numbers is quite different i.e. same number has different bit pattern in FP32 and FP16 (unlike integers where a 16-bit integer has same bit pattern even in 32-bit representation except for leading zeros).
How are the floating point units in GP100 implemented so that nearly twice the speedup is achieved by moving from FP32 to FP16.