how to estimate average cycles per instr?

I want to see if the performance is bounded by instruction efficiency, i.e. if average cycles needed per instruction is close to 1.
Now I’m thinking the following way to do this, for one kernel
instruction efficiency = (#of threads) x(instructions/thread) / (instructions/ cycle)

First term (#of threads) can easily be computed.
Last term is at most 128 , if enough threads occupy all 16 SMs. But I’m not sure how to measure the average effective number for this term.
Second term is most difficult, I’m not sure what’s the right way to estimate this? Will computing #instructions in ptx code do that, assuming branches behaviour is fully known? Or there is better way to do this?

Thanks for any ideas.