Bf16 is half of fp16 tflops with same instruction __hfma2 on H100

Accelerated Computing GPU-Accelerated Libraries

lingjie.wu April 22, 2024, 2:27am 1

I saw in the H100 whitepaper that bf16 and fp16 has same vector(non-tensor core) tflops.
When I using __hfma2 instruction for both precision, fp16 can achieve the peak tflops, but bf16 is only the half of the fp16.
Any insights that how can I get max bf16 tflops?

Topic		Replies	Views
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	625	November 25, 2024
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor GPU-Accelerated Libraries	4	5163	June 21, 2022
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	483	December 10, 2024
Difference in SM performance of float16 and bfloat16 CUDA Programming and Performance	4	811	August 7, 2024
Unexpectedly low performance of cuFFT with half floating point (FP16) GPU-Accelerated Libraries	1	1664	June 16, 2017
Bfloat16 has worse performance than float16 for conv2d in Pytorch CUDA Programming and Performance cuda , kernel , pytorch , python	4	2906	July 6, 2022
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2547	August 12, 2017
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	6258	August 14, 2024
Parallel usage of FP64 and Tensor cores in H100 CUDA Programming and Performance hw , cuda	3	212	December 6, 2024
Some confuse about TX1 and TX2 FLOPS calculation CUDA Programming and Performance	4	5258	May 31, 2019

Bf16 is half of fp16 tflops with same instruction __hfma2 on H100

Related topics