So it seems that they are equal. My question is about the performance of multiplication in FP16 and accumulating in FP32. Is it same as the FP32 peak performance? (I was expected to see FP16 with accumulation in FP16 is sometimes doubling the performance of FP16 with accumulation in FP32.)
The only place I know of where that happens is in the tensorcore (TC) unit(s) which are used for matrix-matrix multiplication. So for TC, and specifically for matrix-matrix multiplication, the throughput is not the same as either the FP16 or FP32 throughput. Based on the whitepaper, the peak theoretical TC throughput for the FP16/FP32 path should be around 70TF (for RTX3090).
For TC/matrix-matrix multiply, that is correct, and also covered in the whitepaper (e.g. Table 2).