I saw in the H100 whitepaper that bf16 and fp16 has same vector(non-tensor core) tflops.
When I using __hfma2 instruction for both precision, fp16 can achieve the peak tflops, but bf16 is only the half of the fp16.
Any insights that how can I get max bf16 tflops?
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor | 4 | 6679 | June 21, 2022 | |
| Rtx 5090 Peak BF16 Tensor TFLOPS | 1 | 485 | December 30, 2025 | |
| How cuda core compute fp16 data in different nvidia arch? | 8 | 1078 | November 25, 2024 | |
| How to calculate the Tensor Core FP16 performance of H100? | 9 | 7792 | August 14, 2024 | |
| What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? | 9 | 1482 | December 10, 2024 | |
| fp16 vs fp32 | 3 | 4257 | November 13, 2017 | |
| Bf16 slower than fp32 on A10 and A100? | 4 | 1859 | July 13, 2024 | |
| 1660ti slower than 1050 (not ti) in tf.float16 matrix multiplication | 0 | 690 | May 12, 2020 | |
| FP16 vs FP32 | 3 | 2466 | May 23, 2019 | |
| Unexpectedly low performance of cuFFT with half floating point (FP16) | 1 | 1748 | June 16, 2017 |