I saw in the H100 whitepaper that bf16 and fp16 has same vector(non-tensor core) tflops.
When I using __hfma2 instruction for both precision, fp16 can achieve the peak tflops, but bf16 is only the half of the fp16.
Any insights that how can I get max bf16 tflops?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
How cuda core compute fp16 data in different nvidia arch? | 8 | 625 | November 25, 2024 | |
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor | 4 | 5163 | June 21, 2022 | |
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? | 9 | 483 | December 10, 2024 | |
Difference in SM performance of float16 and bfloat16 | 4 | 811 | August 7, 2024 | |
Unexpectedly low performance of cuFFT with half floating point (FP16) | 1 | 1664 | June 16, 2017 | |
Bfloat16 has worse performance than float16 for conv2d in Pytorch | 4 | 2906 | July 6, 2022 | |
Question regarding Tensor Cores/GV100 | 8 | 2547 | August 12, 2017 | |
How to calculate the Tensor Core FP16 performance of H100? | 9 | 6258 | August 14, 2024 | |
Parallel usage of FP64 and Tensor cores in H100 | 3 | 212 | December 6, 2024 | |
Some confuse about TX1 and TX2 FLOPS calculation | 4 | 5258 | May 31, 2019 |