RTX 3090 Peak Performance

I am seeing that the peak performance of RTX 3090 for FP32 and FP16 is like this:

[FP16 (half) performance
35.58 TFLOPS (1:1)

FP32 (float) performance
35.58 TFLOPS](NVIDIA GeForce RTX 3090 Specs | TechPowerUp GPU Database)

So it seems that they are equal. My question is about the performance of multiplication in FP16 and accumulating in FP32. Is it same as the FP32 peak performance? (I was expected to see FP16 with accumulation in FP16 is sometimes doubling the performance of FP16 with accumulation in FP32.)

You may wish to review the whitepaper for GA102.

The only place I know of where that happens is in the tensorcore (TC) unit(s) which are used for matrix-matrix multiplication. So for TC, and specifically for matrix-matrix multiplication, the throughput is not the same as either the FP16 or FP32 throughput. Based on the whitepaper, the peak theoretical TC throughput for the FP16/FP32 path should be around 70TF (for RTX3090).

For TC/matrix-matrix multiply, that is correct, and also covered in the whitepaper (e.g. Table 2).