CUDA Compute Profiling results What is the profiling mechanism of GPU performance counters?

Hi,
Can anyone explain me the mechanism how performance counters work on GPU? Does each SM has one set of performance counters?
The reason I ask this because I want to know the number of instruction of ONE specific block (workgroup on OpenCL).
In some cases, the output number of instructions of a kernel having 1 block is not different with the same kernel having 4 blocks. I guess the results is sampled on one SM and then multiplied by the number of SMs, is that true?
If so, assuming there are 16 SMs. So if all blocks do similar things (not too much variances on the number of instructions), the result of setting up a kernel with 1 blocks must be the same with 16 blocks, right? If so, what about if the kernel contains 17 blocks?

Thanks,
Tuan

Any help?