Why is Ampere's theoretical peak FP16 TOPS = 4x peak FP32 TOPS?

Ampere’s theoretical FP32 TOPS is 19.5, while FP16 TOPS is 78.
Why is FP16 performance not twice as FP32, but 4 times as much?

Companies usually don’t explain the rationale underlying processor design decisions in public or in detail. NVIDIA certainly has no history of doing so, so we are left to speculate (hopefully intelligently) about it.

To first order, throughput ratios between different types of arithmetic can be freely chosen in a given processor design. In an industrial for-profit environment, design decisions are influenced by the desire to maximize product sales volume (and ultimately, profits), which requires careful consideration of target markets.

At present, AI makes up a sizeable portion of the target market for NVIDIA GPUs. It is well known that current AI technology requires high-throughput low-precision arithmetic. A plausible hypothesis therefore is that a decision was made to over-proportionally increase FP16 throughput compared to prior architectures to satisfy those market requirements, with the goal of solidifying NVIDIA’s role as a key player in AI.

Thanks njuffa. I was thinking that why FP16 TOPS = 4x peak FP32 TOPs would surprise many GPU programmers, and it should be some basic knowledge; however, I can’t find any material explaining it… Just curious about what is the magic behind it…

No magic, just hard work, I would think. Combine a cutting-edge silicon manufacturing process, highly-optimized but relatively simple cores, and large die size, and processor architects get to lay down thousands of these cores.

The bigger challenge is presumably in the communication infrastructure for control and data and the memory subsystem necessary to keep so many cores well fed.