Single thread benchmarking with clock64()

I recently read this paper:

on additional cost of threads divergence in loops.
clock64() function was used to measure processing times like this:
(with some simplifications but M may be different for each thread)

int M = limits[threadIdx.x];
llong start = clock64();
for (int i = 0; i < M; i++) 
    sum += EXPR_INNER;
llong stop = clock64();
timer[2 * tid] = start;
timer[2 * tid + 1] = stop;

The question is: if we consider just the threads inside one warp, will stop contain different values for each thread or the same?
If all the threads follow the loop to the very end, even when some are not doing anything, then they should reach the clock64() line in the very same moment…

Yes, correct. All threads in a warp should read the same value for clock64() in line 5 of your code. Did you try it?

The warp will keep looping in statements 3 and 4 until all threads have reached their M limit. Then all threads will reach line 5 and all threads will receive the same clock64() value in their own “stop” variable.

But even that is not guaranteed, just likely. Be careful with clock() and clock64() since the compiler may reorder instructions and you may not be measuring exactly before and after the loop. Check the SASS assembly output to make sure the clock() calls have indeed properly bracketed the loop.

Finally, remember the change in clock64() is not “the number of clocks the warp took to execute the loop.” It’s the number of clocks between the times the multiprocessor queried the clock counters, which includes ALL the multiprocessor’s blocks and warps. It’s still useful for both throughput and latency timing, you just have to be careful.

I did some experiments with a little more complicated code and more branching, mostly ifs.
To visualise the results I plot two values for each thread separately. Red points indicate start clock value and blue for stop clock.
You can see first 200 threads clearly divided into several warps.
I am surprised that threads in a single warp get different stop readings.
Is this the mentioned effect of compiler optimizations?

Whether all threads in a warp receive the same value for stop would depend on where ptxas chooses to place the synchronization point. If, for whater reason, it places the synchronization point after the seconds clock64() call, results may differ per warp.

Can you post your full code for the kernel? What compute capability are you running on?

Surely, I know that the results between warps will be different the question is why the measurements within single warp differ between threads.

Unfortunately I cannot paste here my exact code but it is based on a sample above. The only difference is that between measurements the loop is more complicated and several memory operations are done.

This was run on Tesla C2070M with CC=2 and using CUDA 7.5.