The excerpted text says:
The former number is greater than the latter since threads are time sliced.
where the former number is:
the number of clock cycles taken by the device to completely execute the thread
and the latter is:
the number of clock cycles the device actually spent executing thread instructions
Suppose we have a sub-sequence of SASS instructions like this:
LD R0, [a]
LD R3, clock
LD R1, [b]
FMUL R2, R0, R1
LD R4, clock
ST [c], R2
IADD R5, R4, -R3
ST [diff], R5
We are multiplying 2 numbers which must be fetched into registers, and based on our position of the instructions to load from the clock register, we wish to time the duration of the 2nd load instruction and the multiply instruction (or something like that).
Let’s also assume that there are many warps executing this instruction stream.
The warp scheduler issues the first instruction for the first warp. There are no dependencies, and a load operation does not by itself ever cause a stall, so the warp scheduler in the next cycle issues the 2nd instruction for the first warp. Again, no dependencies, no stalls, so the warp scheduler in the 3rd cycle issues the 3rd instruction for warp 0. The clock register has already been sampled by this point (in the second instruction).
Now, the warp scheduler would like to proceed, but the FMUL instruction depends on previous register activity, and so we could imagine or posit a warp stall at this point for warp 0. So the warp scheduler than goes back to the first instruction and issues that for warp 1, and repeats the sequence, until stall, for warp 1. Likewise for warp 2 and so on. Somewhere after the 8th warp, but before the last warp, the stall on warp 0 unblocks/disappears.
The warp scheduler could go on to the 9th warp, issuing the first instruction, or it could go back to warp 0, and issue the FMUL instruction. What will it do? We don’t know, its not specified. Either option is possible. Let’s say it goes on to the 9th warp. That means that the instructions for warp 0 that have already been issued have finished their work, but the next instruction is not issued, yet. However the clock is still running, cycle by cycle.
The warp scheduler goes on to finish issuing the first 3 instructions for each of the 16 warps, and now it goes back to warp 0 and picks up where it left off. It issues the FMUL instruction, and, noticing that there are no dependencies, in the very next cycle it goes ahead and issues the next LD instruction, which samples the clock again.
We now have our two clock samples. But the difference in these clock samples is not purely due to activity in warp 0. Many other warps, and many other instructions, were issued, cycle by cycle, by that warp scheduler. The difference in clock samples is not purely due to a single set of instructions processed by warp 0.
This demonstrates the idea of time-slicing between threads. The warp scheduler has effectively time-sliced the resources of the SM across threads belonging to 16 warps, in this time-frame. The total time measured between the two clock samples does indeed account for the time it took to finish the work associated with those two instructions, but there was also other work that got included, such as the first instruction for other warps, which isn’t part of the instruction stream we had delineated, when we carefully placed our clock sampling point after the first instruction.
In this way, the measured time is greater than, or equal to, the time it took to process just the two instructions we had delineated - the second load instruction and the FMUL instruction.
This also highlights the difficulty associated with trying to use these facilities to measure instruction latency and nothing else. This is a difficult task in a GPU, and there are no built-in hardware monitors that tell you exactly how many cycles a particular instruction spent in a pipeline, or which clock cycles those were. In order to approach such measurements, its usually necessary to construct careful benchmarks that let the GPU operate as a whole, while inferring the per-instruction behavior, without actually directly measuring it.
An example of such careful benchmarking design is done by the citadel group. Its a non-trivial task.