ILP efficiency on compute capability 2.1

I have a GTX560Ti so I have a device with 8 SM with 48 cores each.
I know from various articles that 16 cores are used in a superscalar fashion exploiting Instruction level parallelism (ILP).
I would like to quantify how many instructions are scheduled and executed on these additional 16 cores.

Is there any way to do it?

I saw that in the profiler there is a set of counters called “instructions issued1_0”, “instructions issued1_1”, etc. and also “thread inst executed_0” which mention the concept of “pipeline0”, “pipeline1” etc. What are these counters? Are they related to what I am looking for?