I am analyzing the Roofline performance of a linear layer operator.
The dimensions of the operation are as follows:
the input is (96, 198, 768), the weights are (768, 3072), and the output is (96, 198, 3072).
Theoretically,
(1) the computation should require 96 x 198 x 768 x 3072 x 2 (for both multiplication and addition) = 89,690,996,736 FLOPs.
(2) Considering weight reuse, the theoretical memory access would be 96 x 198 x 768 x 2 (2 bytes) + 96 x 198 x 3072 x 2 (2 bytes) + 768 x 3072 x 2 (2 bytes) = 150,700,032 bytes, resulting in a theoretical operational intensity of 595 FLOPs/byte.
However, when I analyzed this kernel using ncu, I found that the actual operational intensity was only 0.57.
I have also gathered the following data:
Number of FFMA instructions executed:58,982,400
It seems that the source of the operational intensity of 0.57 is from the calculation
(58982400×2) / (93.57+102.01)= 0.57521. Is my understanding correct?
Could you please help me understand how the kernel’s computation is completed with just 58,982,400 FFMA instructions? Additionally, why is the operational intensity reported as 0.57? What are the rules or factors affecting these statistics?