How to retrieve performance counters for a range of cuda code?

Hello, I am using PerfKit 4.4 under Windows 7 on a Tesla K40c using Dev Driver 385.08.
The API seems to retrieve the counter values across the entire kernel execution.
Is there a way I can instrument (modify?) the kernel so that the counters are retrieved with respect to a subset of the code in the kernel?