Just a later thread saying thanks, this helped me get on the right track.
In my case, I observed the following, I had:
big body of code (sprinkled with middle clock timings)
I observed timings that didn’t make much sense, all the clocks were within a few ticks of each other. So I compiled it to cubin and used the sass flag to look at the assembly, and the code matched. I was baffled. Finally after compiling to cubin with the parameter for 5.2 compute capability, I observed different assembly, it was doing this:
big body of code
sprinkled middle clock timing attempts, with start clock mixed in there
It seems due to a lack of dependencies, nvcc just reordered non-dependent code and threw all the clocks at the bottom.
I had to force a dependency on the top and the bottom as BulatZiganshin suggested. My dependent computing was j, so I had:
big body of code computing j, sprinkled with (if j) run middle clocks
The assembly showed it finally kept the clocks from reordering. A downside of all this is it’s harder to measure a single instruction. In my case I’m measuring 1024 instructions, so the overhead of the timing is minimal. I suppose the only way around this is compiling in debug mode, usually that keeps existing code order? Anyway, I’m back on track, and I don’t think I would have got here without this threads help. Thanks again!