# how to estimate average cycles per instr?

I want to see if the performance is bounded by instruction efficiency, i.e. if average cycles needed per instruction is close to 1.
Now I’m thinking the following way to do this, for one kernel