Question Regarding Roofline and Operational Intensity

I am analyzing the Roofline performance of a linear layer operator.

The dimensions of the operation are as follows:
the input is (96, 198, 768), the weights are (768, 3072), and the output is (96, 198, 3072).

Theoretically,
(1) the computation should require 96 x 198 x 768 x 3072 x 2 (for both multiplication and addition) = 89,690,996,736 FLOPs.
(2) Considering weight reuse, the theoretical memory access would be 96 x 198 x 768 x 2 (2 bytes) + 96 x 198 x 3072 x 2 (2 bytes) + 768 x 3072 x 2 (2 bytes) = 150,700,032 bytes, resulting in a theoretical operational intensity of 595 FLOPs/byte.

However, when I analyzed this kernel using ncu, I found that the actual operational intensity was only 0.57.

I have also gathered the following data:

Number of FFMA instructions executed:58,982,400

It seems that the source of the operational intensity of 0.57 is from the calculation
(58982400×2) / (93.57+102.01)= 0.57521. Is my understanding correct?

Could you please help me understand how the kernel’s computation is completed with just 58,982,400 FFMA instructions? Additionally, why is the operational intensity reported as 0.57? What are the rules or factors affecting these statistics?

The roofline in the screenshot only supports FP32 instructions (FFMA, FADD, FMUL).

The cutlass kernel listed in the command line image is using Tensor instruction (HMMA). The RELU is likely implemented in FP32 so the only work in the roofline is the RELU.

Nsight Compute version 2024.3.0 included in CUDA 12.6 supports roofline for Tensor operations. Please make sure to select roofline or enable collection of the GPU Speed Of Light Hierarchical Roofline Chart (Tensor Core).

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.