Questin regarding latency

Hi!
I’m pretty new to CUDA programming and I’m just going through the NVIDIA manuals.
I’ve got a question regarding the latency (the number of clock cycles it takes for a warp to be ready to execute its next instruction).
They say:
“In order to hide a latency of L clock cycles with basic single-precision floating-point arithmetic instructions L/4 (rounded up to nearest integer) instructions are required for devices of compute capability 1.x since a multiprocessor issues one such instruction per warp over 4 clock cycles.”

What does it mean that a multiprocessor issues an instruction? Does it mean that the instruction is executed?
That’s what I first thought, but on the next page I read the following:
“If all input operands are registers, latency is caused by register dependencies, i.e., some of the input operands are written by some previous instruction(s) whose execution has not completed yet. In the case of a back-to-back register dependency (i.e., some input operand is written by the previous instruction), the latency is equal to the execution time of the previous instruction and the warp scheduler must schedule instructions for different warps during that time. Execution time varies depending on the instruction, but it is typically about 22 clock cycles, which translates to 6 warps for devices of compute capability 1.x.”

Here it states that the execution time is generally 22 clock cycles…

Can someone help and understand the exact terminology (I’m not a native English speaker)?

Thanks very much!

Here we had a little chit-chat bout issuing instructions, latency and all that stuff: how many threads to hide latency

Here we had a little chit-chat bout issuing instructions, latency and all that stuff: how many threads to hide latency

If I don’t understand the basic terminology, I can’t understand that topic too…

So the main question remains: What does it mean that a multiprocessor issues an instruction? Does it mean that the instruction is executed?

Thanks again!

If I don’t understand the basic terminology, I can’t understand that topic too…

So the main question remains: What does it mean that a multiprocessor issues an instruction? Does it mean that the instruction is executed?

Thanks again!

The two basic terms are latency and throughput. The latency is the time it takes from starting execution until the results are available (so basically all the time it takes to execute the instruction). This is 22 to 24 cycles for most instructions in devices of compute capability 1.x.
Throughput on the other hand is the number of instructions that finish execution in a given amount of time. For many instructions it is one instruction every four cycles. This is faster than the latency number suggests because execution of multiple instruction is overlapped (“pipelined”).

Issuing an instruction means that it is sent to the functional units that execute it, so basically that starts its execution (although at that time it has already been decoded, so the total time for processing an instruction is larger).
Finishing execution of an instruction, so that the results are available, is called “retiring”. Some processors (but no CUDA devices so far) speculatively start to execute conditional instructions before the result of the condition is available, and only retire the instruction if the condition evaluates to true. Thus the number of instructions retiring per time (the throughput) may be smaller than the number of instructions starting execution.

Here is some more info for compute 1.x devices I wrote in a different thread:

Each core can start execution of one instruction of a thread per clock cycle. As a multiprocessor has only 8 cores, this takes 4 cycles for the 32 threads of a warp. So after 4 cycles, the warp can start execution of another instruction. This is the 4 cycles throughput number given in the Programming guide.

However, the result of the operation is only known after execution of the instruction finishes after about 22…24 (fast) clock cycles. So if the next operation wants to use this result (“depends on it”), it has to wait for 24 (instead of 4) clock cycles. Now if you have 6 warps running in a round-robin fashion, each warps instruction results will be available just in time for the next instruction of that warp to run. This is called latency hiding, as it now appears that latency is irrelevant and only throughput determines the execution speed.

For this it does not matter whether the 6 warps are from the same thread block or a different one. However, warps from the same block tend to block (waiting for memory or at a __syncthreads()) at the same time (as they execute the same instruction stream and syncronize at each __syncthreads()), so there is a slight advantage if some of the warps are from a different block.

There is another form of “latency hiding” in the Programming Guide. Since the latency numbers are mostly irrelevant once you have at least 6 warps per SM, the Programming Guide just gives the throughput numbers and the rule to have at least 6 warps per SM. Latency is mostly hidden in the Guide as well. :)

Hope this helps.

The two basic terms are latency and throughput. The latency is the time it takes from starting execution until the results are available (so basically all the time it takes to execute the instruction). This is 22 to 24 cycles for most instructions in devices of compute capability 1.x.
Throughput on the other hand is the number of instructions that finish execution in a given amount of time. For many instructions it is one instruction every four cycles. This is faster than the latency number suggests because execution of multiple instruction is overlapped (“pipelined”).

Issuing an instruction means that it is sent to the functional units that execute it, so basically that starts its execution (although at that time it has already been decoded, so the total time for processing an instruction is larger).
Finishing execution of an instruction, so that the results are available, is called “retiring”. Some processors (but no CUDA devices so far) speculatively start to execute conditional instructions before the result of the condition is available, and only retire the instruction if the condition evaluates to true. Thus the number of instructions retiring per time (the throughput) may be smaller than the number of instructions starting execution.

Here is some more info for compute 1.x devices I wrote in a different thread:

Each core can start execution of one instruction of a thread per clock cycle. As a multiprocessor has only 8 cores, this takes 4 cycles for the 32 threads of a warp. So after 4 cycles, the warp can start execution of another instruction. This is the 4 cycles throughput number given in the Programming guide.

However, the result of the operation is only known after execution of the instruction finishes after about 22…24 (fast) clock cycles. So if the next operation wants to use this result (“depends on it”), it has to wait for 24 (instead of 4) clock cycles. Now if you have 6 warps running in a round-robin fashion, each warps instruction results will be available just in time for the next instruction of that warp to run. This is called latency hiding, as it now appears that latency is irrelevant and only throughput determines the execution speed.

For this it does not matter whether the 6 warps are from the same thread block or a different one. However, warps from the same block tend to block (waiting for memory or at a __syncthreads()) at the same time (as they execute the same instruction stream and syncronize at each __syncthreads()), so there is a slight advantage if some of the warps are from a different block.

There is another form of “latency hiding” in the Programming Guide. Since the latency numbers are mostly irrelevant once you have at least 6 warps per SM, the Programming Guide just gives the throughput numbers and the rule to have at least 6 warps per SM. Latency is mostly hidden in the Guide as well. :)

Hope this helps.