I have profiled an ML application with 1 (https://pasteboard.co/IqNhJks.png) and 3 (https://pasteboard.co/IqNfmOJ.png) epochs. As you can see in the pictures, the profile duration for the second run is larger than the first one. That is correct. However, for the largest kernel, the number of FP and INT and other instructions remain unchanged.
That is weird. Isn’t it?