Rtx 5090 Peak BF16 Tensor TFLOPS

I noticed the spec says the BF16 hashrate is 209T. Could you please explain what “acc fp32” means here?

Does this correspond to the computing power of the MMA instruction SM80_16x8x8_F32BF16BF16F32_TN? If so, then the computing power of the MMA instruction SM80_16x8x16_F16F16F16F16_TN is 419T?

16-bit tensor core paths generally have the option to accumulate the vector dot-product intermediate results at either 16 bit or 32 bit precision. “acc fp32” means FP32 accumulate which means that the individual multiplications are two 16-bit quantities, multiplied together into a 32-bit result, with the result added to a 32-bit accumulator.

I’m not familiar with those MMA instructions either at the PTX level or the SASS level, but the accumulate function of a PTX MMA instruction can be determined from the PTX guide.

I don’t happen to know offhand which spec you have referenced here. It’s usually helpful to others to provide a link in such cases. I guess it is here.

When I looked through the PTX guide that I linked above, I did not find any MMA instructions operating on BF16 that had anything other than FP32 accumulate.

The other instruction you reference:

does not appear to be BF16. So for that one I would look for a specification for FP16 with FP16 accumulate. Using the spec link I previously gave, for FP16 (not BF16) with FP16 accumulate, the stated peak theoretical throughput is indeed 419TF (dense, table 3). But you don’t need to refer to the BF16 number you circled, that number is directly stated in the table.