Rtx 5090 Peak BF16 Tensor TFLOPS

shaunpeng2018 · November 7, 2025, 8:13am

I noticed the spec says the BF16 hashrate is 209T. Could you please explain what “acc fp32” means here?

Does this correspond to the computing power of the MMA instruction SM80_16x8x8_F32BF16BF16F32_TN? If so, then the computing power of the MMA instruction SM80_16x8x16_F16F16F16F16_TN is 419T?

Robert_Crovella · December 30, 2025, 10:15pm

16-bit tensor core paths generally have the option to accumulate the vector dot-product intermediate results at either 16 bit or 32 bit precision. “acc fp32” means FP32 accumulate which means that the individual multiplications are two 16-bit quantities, multiplied together into a 32-bit result, with the result added to a 32-bit accumulator.

I’m not familiar with those MMA instructions either at the PTX level or the SASS level, but the accumulate function of a PTX MMA instruction can be determined from the PTX guide.

I don’t happen to know offhand which spec you have referenced here. It’s usually helpful to others to provide a link in such cases. I guess it is here.

When I looked through the PTX guide that I linked above, I did not find any MMA instructions operating on BF16 that had anything other than FP32 accumulate.

The other instruction you reference:

does not appear to be BF16. So for that one I would look for a specification for FP16 with FP16 accumulate. Using the spec link I previously gave, for FP16 (not BF16) with FP16 accumulate, the stated peak theoretical throughput is indeed 419TF (dense, table 3). But you don’t need to refer to the BF16 number you circled, that number is directly stated in the table.

Topic		Replies	Views
Rtx 5090 Spec's FP16 Tensor TFLOPS is ambiguous GPU - Hardware cuda , rtx	6	604	November 19, 2025
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor GPU-Accelerated Libraries	4	6589	June 21, 2022
RTX 3090 Peak Performance GPU-Accelerated Libraries cutensor	1	9036	December 14, 2021
Bf16 is half of fp16 tflops with same instruction __hfma2 on H100 GPU-Accelerated Libraries	0	549	April 22, 2024
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	1413	December 10, 2024
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	7703	August 14, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2713	August 12, 2017
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1249	April 26, 2018
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	1050	November 25, 2024
Mma m8n8k4 on A100 CUDA Programming and Performance	10	427	November 14, 2024

Rtx 5090 Peak BF16 Tensor TFLOPS

Related topics