Bf16 is half of fp16 tflops with same instruction __hfma2 on H100

I saw in the H100 whitepaper that bf16 and fp16 has same vector(non-tensor core) tflops.
When I using __hfma2 instruction for both precision, fp16 can achieve the peak tflops, but bf16 is only the half of the fp16.
Any insights that how can I get max bf16 tflops?