I recently read this paper:
on additional cost of threads divergence in loops.
clock64() function was used to measure processing times like this:
(with some simplifications but M may be different for each thread)
int M = limits[threadIdx.x]; llong start = clock64(); for (int i = 0; i < M; i++) sum += EXPR_INNER; llong stop = clock64(); timer[2 * tid] = start; timer[2 * tid + 1] = stop;
The question is: if we consider just the threads inside one warp, will stop contain different values for each thread or the same?
If all the threads follow the loop to the very end, even when some are not doing anything, then they should reach the clock64() line in the very same moment…